Dynamic Programming and Optimal Control 3rd Edition, Volume II

Dynamic Programming and Optimal Control
3rd Edition, Volume II

by
Dimitri P. Bertsekas
Massachusetts Institute of Technology
Chapter 6
Approximate Dynamic Programming
This is an updated version of the research-oriented Chapter 6 on
Approximate Dynamic Programming. It will be periodically updated as
new research becomes available, and will replace the current Chapter 6 in
the books next printing.
In addition to editorial revisions, rearrangements, and new exercises,
the chapter includes an account of new research, which is collected mostly
in Sections 6.3 and 6.8. Furthermore, a lot of new material has been
added, such as an account of post-decision state simplications (Section
6.1), regression-based TD methods (Section 6.3), feature scaling (Section
6.3), policy oscillations (Section 6.3), -policy iteration and exploration
enhanced TD methods, aggregation methods (Section 6.4), new Q-learning
algorithms (Section 6.5), and Monte Carlo linear algebra (Section 6.8).
This chapter represents work in progress. It more than likely con-
tains errors (hopefully not serious ones). Furthermore, its references to the
literature are incomplete. Your comments and suggestions to the author
at dimitrib@mit.edu are welcome. The date of last revision is given below.
September 9, 2011
6
Approximate
Dynamic Programming
Contents
6.1. General Issues of Cost Approximation . . . . . . . . p. 327
6.1.1. Approximation Architectures . . . . . . . . . p. 327
6.1.2. Approximate Policy Iteration . . . . . . . . . p. 332
6.1.3. Direct and Indirect Approximation . . . . . . p. 337
6.1.4. Simplications . . . . . . . . . . . . . . . p. 339
6.1.5. Monte Carlo Simulation . . . . . . . . . . . p. 345
6.1.6. Contraction Mappings and Simulation . . . . . p. 348
6.2. Direct Policy Evaluation - Gradient Methods . . . . . p. 351
6.3. Projected Equation Methods . . . . . . . . . . . . p. 357
6.3.1. The Projected Bellman Equation . . . . . . . p. 358
6.3.2. Projected Value Iteration - Other Iterative Methodsp. 363
6.3.3. Simulation-Based Methods . . . . . . . . . . p. 367
6.3.4. LSTD, LSPE, and TD(0) Methods . . . . . . p. 369
6.3.5. Optimistic Versions . . . . . . . . . . . . . p. 380
6.3.6. Multistep Simulation-Based Methods . . . . . p. 381
6.3.7. Policy Iteration Issues - Exploration . . . . . . p. 394
6.3.8. Policy Oscillations - Chattering . . . . . . . . p. 403
6.3.9. -Policy Iteration . . . . . . . . . . . . . . p. 414
6.3.10. A Synopsis . . . . . . . . . . . . . . . . p. 420
6.4. Aggregation Methods . . . . . . . . . . . . . . . p. 425
6.4.1. Cost Approximation via the Aggregate Problem . p. 428
6.4.2. Cost Approximation via the Enlarged Problem . p. 431
6.5. Q-Learning . . . . . . . . . . . . . . . . . . . . p. 440
6.5.1. Convergence Properties of Q-Learning . . . . . p. 443
6.5.2. Q-Learning and Approximate Policy Iteration . . p. 447
6.5.3. Q-Learning for Optimal Stopping Problems . . . p. 450
6.5.4. Finite Horizon Q-Learning . . . . . . . . . . p. 455
321
322 Approximate Dynamic Programming Chap. 6
6.6. Stochastic Shortest Path Problems . . . . . . . . . p. 458
6.7. Average Cost Problems . . . . . . . . . . . . . . p. 462
6.7.1. Approximate Policy Evaluation . . . . . . . . p. 462
6.7.2. Approximate Policy Iteration . . . . . . . . . p. 471
6.7.3. Q-Learning for Average Cost Problems . . . . . p. 474
6.8. Simulation-Based Solution of Large Systems . . . . . p. 477
6.8.1. Projected Equations - Simulation-Based Versions p. 479
6.8.2. Matrix Inversion and Regression-Type Methods . p. 484
6.8.3. Iterative/LSPE-Type Methods . . . . . . . . p. 486
6.8.4. Multistep Methods . . . . . . . . . . . . . p. 493
6.8.5. Extension of Q-Learning for Optimal Stopping . p. 496
6.8.6. Bellman Equation Error-Type Methods . . . . p. 498
6.8.7. Oblique Projections . . . . . . . . . . . . . p. 503
6.8.8. Generalized Aggregation by Simulation . . . . . p. 504
6.9. Approximation in Policy Space . . . . . . . . . . . p. 509
6.9.1. The Gradient Formula . . . . . . . . . . . . p. 510
6.9.2. Computing the Gradient by Simulation . . . . p. 511
6.9.3. Essential Features of Critics . . . . . . . . . p. 513
6.9.4. Approximations in Policy and Value Space . . . p. 515
6.10. Notes, Sources, and Exercises . . . . . . . . . . . p. 516
References . . . . . . . . . . . . . . . . . . . . . . p. 539
Sec. 6.0 323
In this chapter we consider approximation methods for challenging, compu-
tationally intensive DP problems. We discussed a number of such methods
in Chapter 6 of Vol. I and Chapter 1 of the present volume, such as for
example rollout and other one-step lookahead approaches. Here our focus
will be on algorithms that are mostly patterned after two principal methods
of innite horizon DP: policy and value iteration. These algorithms form
the core of a methodology known by various names, such as approximate
dynamic programming, or neuro-dynamic programming, or reinforcement
learning.
A principal aim of the methods of this chapter is to address problems
with very large number of states n. In such problems, ordinary linear
algebra operations such as n-dimensional inner products, are prohibitively
time-consuming, and indeed it may be impossible to even store an n-vector
in a computer memory. Our methods will involve linear algebra operations
of dimension much smaller than n, and require only that the components
of n-vectors are just generated when needed rather than stored.
Another aim of the methods of this chapter is to address model-free
situations, i.e., problems where a mathematical model is unavailable or
hard to construct. Instead, the system and cost structure may be sim-
ulated (think, for example, of a queueing network with complicated but
well-dened service disciplines at the queues). The assumption here is that
there is a computer program that simulates, for a given control u, the prob-
abilistic transitions from any given state i to a successor state j according
to the transition probabilities p
ij
(u), and also generates a corresponding
transition cost g(i, u, j).
Given a simulator, it may be possible to use repeated simulation to
calculate (at least approximately) the transition probabilities of the system
and the expected stage costs by averaging, and then to apply the methods
discussed in earlier chapters. The methods of this chapter, however, are
geared towards an alternative possibility, which is much more attractive
when one is faced with a large and complex system, and one contemplates
approximations. Rather than estimate explicitly the transition probabil-
ities and costs, we will aim to approximate the cost function of a given
policy or even the optimal cost-to-go function by generating one or more
simulated system trajectories and associated costs, and by using some form
of least squares t.
Implicit in the rationale of methods based on cost function approxi-
mation is of course the hypothesis that a more accurate cost-to-go approx-
imation will yield a better one-step or multistep lookahead policy. This
is a reasonable but by no means self-evident conjecture, and may in fact
not even be true in a given problem. In another type of method, which
we will discuss somewhat briey, we use simulation in conjunction with a
gradient or other method to approximate directly an optimal policy with
a policy of a given parametric form. This type of method does not aim at
good cost function approximation through which a well-performing policy
may be obtained. Rather it aims directly at nding a policy with good
performance.
Let us also mention, two other approximate DP methods, which we
have discussed at various points in other parts of the book, but we will not
consider further: rollout algorithms (Sections 6.4, 6.5 of Vol. I, and Section
1.3.5 of Vol. II), and approximate linear programming (Section 1.3.4).
Our main focus will be on two types of methods: policy evaluation al-
gorithms, which deal with approximation of the cost of a single policy (and
can also be embedded within a policy iteration scheme), and Q-learning
algorithms, which deal with approximation of the optimal cost. Let us
summarize each type of method, focusing for concreteness on the nite-
state discounted case.
Policy Evaluation Algorithms
With this class of methods, we aim to approximate the cost function J
(i)
of a policy with a parametric architecture of the form

J(i, r), where
r is a parameter vector (cf. Section 6.3.5 of Vol. I). This approximation
may be carried out repeatedly, for a sequence of policies, in the context
of a policy iteration scheme. Alternatively, it may be used to construct
an approximate cost-to-go function of a single suboptimal/heuristic policy,
which can be used in an on-line rollout scheme, with one-step or multistep
lookahead. We focus primarily on two types of methods.
In the rst class of methods, called direct , we use simulation to collect
samples of costs for various initial states, and t the architecture

J to
the samples through some least squares problem. This problem may be
solved by several possible algorithms, including linear least squares methods
based on simple matrix inversion. Gradient methods have also been used
extensively, and will be described in Section 6.2.
The second and currently more popular class of methods is called
indirect . Here, we obtain r by solving an approximate version of Bellmans
equation. We will focus exclusively on the case of a linear architecture,
where

J is of the form r, and is a matrix whose columns can be viewed
as basis functions (cf. Section 6.3.5 of Vol. I). In an important method of
In another type of policy evaluation method, often called the Bellman equa-
tion error approach, which we will discuss briey in Section 6.8.4, the parameter
vector r is determined by minimizing a measure of error in satisfying Bellmans
equation; for example, by minimizing over r
J T

J,
where is some norm. If is a Euclidean norm, and

J(i, r) is linear in r,
this minimization is a linear least squares problem.
Sec. 6.0 325
this type, we obtain the parameter vector r by solving the equation
r = T(r), (6.1)
where denotes projection with respect to a suitable norm on the subspace
of vectors of the form r, and T is either the mapping T
or a related
mapping, which also has J
as its unique xed point [here T(r) denotes

the projection of the vector T(r) on the subspace].
We can view Eq. (6.1) as a form of projected Bellman equation. We
will show that for a special choice of the norm of the projection, T is
a contraction mapping, so the projected Bellman equation has a unique
solution r
. We will discuss several iterative methods for nding r
in
Section 6.3. All these methods use simulation and can be shown to converge
under reasonable assumptions to r
, so they produce the same approximate

cost function. However, they dier in their speed of convergence and in
their suitability for various problem contexts. Here are the methods that we
will focus on in Section 6.3 for discounted problems, and also in Sections 6.6-
6.8 for other types of problems. They all depend on a parameter [0, 1],
whose role will be discussed later.
(1) TD() or temporal dierences method. This algorithm may be viewed
as a stochastic iterative method for solving a version of the projected
equation (6.1) that depends on . The algorithm embodies important
ideas and has played an important role in the development of the
subject, but in practical terms, it is usually inferior to the next two
methods, so it will be discussed in less detail.
(2) LSTD() or least squares temporal dierences method. This algo-
rithm computes and solves a progressively more rened simulation-
based approximation to the projected Bellman equation (6.1).
(3) LSPE() or least squares policy evaluation method. This algorithm
is based on the idea of executing value iteration within the lower
dimensional space spanned by the basis functions. Conceptually, it
has the form
r
k+1
= T(r
k
) + simulation noise, (6.2)
Another method of this type is based on aggregation (cf. Section 6.3.4 of
Vol. I) and is discussed in Section 6.4. This approach can also be viewed as a
problem approximation approach (cf. Section 6.3.3 of Vol. I): the original problem
is approximated with a related aggregate problem, which is then solved exactly
to yield a cost-to-go approximation for the original problem. The aggregation
counterpart of the equation r = T(r) has the form r = DT(r), where
and D are matrices whose rows are restricted to be probability distributions
(the aggregation and disaggregation probabilities, respectively).
i.e., the current value iterate T(r
k
) is projected on S and is suitably
approximated by simulation. The simulation noise tends to 0 asymp-
totically, so assuming that T is a contraction, the method converges
to the solution of the projected Bellman equation (6.1). There are
also a number of variants of LSPE(). Both LSPE() and its vari-
ants have the same convergence rate as LSTD(), because they share
a common bottleneck: the slow speed of simulation.
Q-Learning Algorithms
With this class of methods, we aim to compute, without any approximation,
the optimal cost function (not just the cost function of a single policy). Q-
learning maintains and updates for each state-control pair (i, u) an estimate
of the expression that is minimized in the right-hand side of Bellmans
equation. This is called the Q-factor of the pair (i, u), and is denoted
by Q
(i, u). The Q-factors are updated with what may be viewed as a
simulation-based form of value iteration, as will be explained in Section
6.5. An important advantage of using Q-factors is that when they are
available, they can be used to obtain an optimal control at any state i
simply by minimizing Q
(i, u) over u U(i), so the transition probabilities

of the problem are not needed.
On the other hand, for problems with a large number of state-control
pairs, Q-learning is often impractical because there may be simply too
many Q-factors to update. As a result, the algorithm is primarily suitable
for systems with a small number of states (or for aggregated/few-state
versions of more complex systems). There are also algorithms that use
parametric approximations for the Q-factors (see Section 6.5), although
their theoretical basis is generally less solid.
Chapter Organization
Throughout this chapter, we will focus almost exclusively on perfect state
information problems, involving a Markov chain with a nite number of
states i, transition probabilities p
ij
(u), and single stage costs g(i, u, j). Ex-
tensions of many of the ideas to continuous state spaces are possible, but
they are beyond our scope. We will consider rst, in Sections 6.1-6.5, the
discounted problem using the notation of Section 1.3. Section 6.1 pro-
vides a broad overview of cost approximation architectures and their uses
in approximate policy iteration. Section 6.2 focuses on direct methods for
policy evaluation. Section 6.3 is a long section on a major class of indirect
methods for policy evaluation, which are based on the projected Bellman
equation. Section 6.4 discusses methods based on aggregation. Section 6.5
discusses Q-learning and its variations, and extends the projected Bellman
equation approach to the case of multiple policies, and particularly to opti-
mal stopping problems. Stochastic shortest path and average cost problems
Sec. 6.1 General Issues of Cost Approximation 327
are discussed in Sections 6.6 and 6.7, respectively. Section 6.8 extends and
elaborates on the projected Bellman equation approach of Sections 6.3,
6.6, and 6.7, discusses another approach based on the Bellman equation
error, and generalizes the aggregation methodology. Section 6.9 describes
methods based on parametric approximation of policies rather than cost
functions.
6.1 GENERAL ISSUES OF COST APPROXIMATION
Most of the methodology of this chapter deals with approximation of some
type of cost function (optimal cost, cost of a policy, Q-factors, etc). The
purpose of this section is to highlight the main issues involved, without
getting too much into the mathematical details.
We start with general issues of parametric approximation architec-
tures, which we have also discussed in Vol. I (Section 6.3.5). We then
consider approximate policy iteration (Section 6.1.2), and the two general
approaches for approximate cost evaluation (direct and indirect; Section
6.1.3). In Section 6.1.4, we discuss various special structures that can be
exploited to simplify approximate policy iteration. In Sections 6.1.5 and
6.1.6 we provide orientation into the main mathematical issues underlying
the methodology, and focus on two of its main components: contraction
mappings and simulation.
6.1.1 Approximation Architectures
The major use of cost approximation is for obtaining a one-step lookahead
suboptimal policy (cf. Section 6.3 of Vol. I). In particular, suppose that
we use

J(j, r) as an approximation to the optimal cost of the nite-state
discounted problem of Section 1.3. Here

J is a function of some chosen
form (the approximation architecture) and r is a parameter/weight vector.
Once r is determined, it yields a suboptimal control at any state i via the
one-step lookahead minimization
(i) = arg min
uU(i)
n
j=1
p
ij
(u)
_
g(i, u, j) +

J(j, r)
_
. (6.3)
The degree of suboptimality of , as measured by |J

J
, is bounded
by a constant multiple of the approximation error according to
|J

J

2
1
|
J J
,
We may also use a multiple-step lookahead minimization, with a cost-to-go
approximation at the end of the multiple-step horizon. Conceptually, single-step
and multiple-step lookahead approaches are similar, and the cost-to-go approxi-
mation algorithms of this chapter apply to both.
as shown in Prop. 1.3.7. This bound is qualitative in nature, as it tends to
be quite conservative in practice.
An alternative possibility is to obtain a parametric approximation
Q(i, u, r) of the Q-factor of the pair (i, u), dened in terms of the optimal
cost function J
as
Q
(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) +J
(j)
_
.
Since Q
(i, u) is the expression minimized in Bellmans equation, given the

approximation

Q(i, u, r), we can generate a suboptimal control at any state
i via
(i) = arg min
uU(i)
Q(i, u, r).
The advantage of using Q-factors is that in contrast with the minimiza-
tion (6.3), the transition probabilities p
ij
(u) are not needed in the above
minimization. Thus Q-factors are better suited to the model-free context.
Note that we may similarly use approximations to the cost functions
J
and Q-factors Q
(i, u) of specic policies . A major use of such ap-

proximations is in the context of an approximate policy iteration scheme;
see Section 6.1.2.
The choice of architecture is very signicant for the success of the
approximation approach. One possibility is to use the linear form
J(i, r) =
s
k=1
r
k
k
(i), (6.4)
where r = (r
1
, . . . , r
s
) is the parameter vector, and
k
(i) are some known
scalars that depend on the state i. Thus, for each state i, the approximate
cost

J(i, r) is the inner product (i)
r of r and
(i) =
_
_
_
1
(i)
.
.
.
s
(i)
_
_
_.
We refer to (i) as the feature vector of i, and to its components as features
(see Fig. 6.1.1). Thus the cost function is approximated by a vector in the
subspace
S = r [ r
s
,
where
=
_
_
_
1
(1) . . .
s
(1)
.
.
.
.
.
.
.
.
.
1
(n) . . .
s
(n)
_
_
_ =
_
_
_
(1)
.
.
.
(n)
_
_
_.
State i Feature Extraction Mapping Feature Vector
Approximator
i Feature Extraction Mapping Feature Vector
Approximator ( ) Feature Extraction Mapping Feature Vector Feature Extraction Mapping Feature Vector
Feature Extraction Mapping Feature Vector (i) Linear Cost
i) Linear Cost
i) Linear Cost
Approximator (i)
r
Figure 6.1.1 A linear feature-based architecture. It combines a mapping that
extracts the feature vector (i) =
_
1
(i), . . . , s(i)
_
associated with state i, and

a parameter vector r to form a linear cost approximator.
We can view the s columns of as basis functions, and r as a linear
combination of basis functions.
Features, when well-crafted, can capture the dominant nonlinearities
of the cost function, and their linear combination may work very well as an
approximation architecture. For example, in computer chess (Section 6.3.5
of Vol. I) where the state is the current board position, appropriate fea-
tures are material balance, piece mobility, king safety, and other positional
factors.
Example 6.1.1 (Polynomial Approximation)
An important example of linear cost approximation is based on polynomial
basis functions. Suppose that the state consists of q integer components
x1, . . . , xq, each taking values within some limited range of integers. For
example, in a queueing system, x
k
may represent the number of customers
in the kth queue, where k = 1, . . . , q. Suppose that we want to use an
approximating function that is quadratic in the components x
k
. Then we
can dene a total of 1 + q + q
2
basis functions that depend on the state
x = (x1, . . . , xq) via
0(x) = 1,
k
(x) = x
k
,
km
(x) = x
k
xm, k, m = 1, . . . , q.
A linear approximation architecture that uses these functions is given by
J(x, r) = r0 +
q
k=1
r
k
x
k
+
q
k=1
q
m=k
r
km
x
k
xm,
where the parameter vector r has components r0, r
k
, and r
km
, with k =
1, . . . , q, m = k, . . . , q. In fact, any kind of approximating function that is
polynomial in the components x1, . . . , xq can be constructed similarly.
It is also possible to combine feature extraction with polynomial approx-
imations. For example, the feature vector (i) =
_
1(i), . . . , s(i)
_
trans-
formed by a quadratic polynomial mapping, leads to approximating functions
of the form
J(i, r) = r0 +
s
k=1
r
k
k
(i) +
s
k=1
s
=1
r
k
k
(i)
(i),
where the parameter vector r has components r0, r
k
, and r
k
, with k, =
1, . . . , s. This function can be viewed as a linear cost approximation that
uses the basis functions
w0(i) = 1, w
k
(i) =
k
(i), w
k
(i) =
k
(i)
(i), k, = 1, . . . , s.
Example 6.1.2 (Interpolation)
A common type of approximation of a function J is based on interpolation.
Here, a set I of special states is selected, and the parameter vector r has one
component ri per state i I, which is the value of J at i:
ri = J(i), i I.
The value of J at states i / I is approximated by some form of interpolation
using r.
Interpolation may be based on geometric proximity. For a simple ex-
ample that conveys the basic idea, let the system states be the integers within
some interval, let I be a subset of special states, and for each state i let i and
i be the states in I that are closest to i from below and from above. Then for
any state i,

J(i, r) is obtained by linear interpolation of the costs ri = J(i)
and r
i
= J(
i):
J(i, r) =
i i
i i
ri +
i i
i i
r
i
.
The scalars multiplying the components of r may be viewed as features, so
the feature vector of i above consists of two nonzero features (the ones cor-
responding to i and

i), with all other features being 0. Similar examples can
be constructed for the case where the state space is a subset of a multidimen-
sional space (see Example 6.3.13 of Vol. I).
A generalization of the preceding example is approximation based on
aggregation; see Section 6.3.4 of Vol. I and the subsequent Section 6.4 in
this chapter. There are also interesting nonlinear approximation architec-
tures, including those dened by neural networks, perhaps in combination
with feature extraction mappings (see Bertsekas and Tsitsiklis [BeT96], or
Sutton and Barto [SuB98] for further discussion). In this chapter, we will
mostly focus on the case of linear architectures, because many of the policy
evaluation algorithms of this chapter are valid only for that case.
We note that there has been considerable research on automatic ba-
sis function generation approaches (see e.g., Keller, Mannor, and Precup
[KMP06], and Jung and Polani [JuP07]). Moreover it is possible to use
standard basis functions which may be computed by simulation (perhaps
with simulation error). The following example discusses this possibility.
Example 6.1.3 (Krylov Subspace Generating Functions)
We have assumed so far that the columns of , the basis functions, are known,
and the rows (i)
of are explicitly available to use in the various simulation-

based formulas. We will now discuss a class of basis functions that may not
be available, but may be approximated by simulation in the course of various
algorithms. For concreteness, let us consider the evaluation of the cost vector
J = (I P)
1
g
of a policy in a discounted MDP. Then J has an expansion of the form
J =
t=0
t
P
t
g.
Thus g, Pg, . . . , P
s
g yield an approximation based on the rst s+1 terms

of the expansion, and seem suitable choices as basis functions. Also a more
general expansion is
J = J +
t=0
t
P
t
q,
where J is any vector in
n
and q is the residual vector
q = TJ J = g +PJ J;
this can be seen from the equation J J = P(J J) +q. Thus the basis
functions J, q, Pq, . . . , P
s1
q yield an approximation based on the rst s +1

terms of the preceding expansion.
Generally, to implement various methods in subsequent sections with
basis functions of the form P
m
g, m 0, one would need to generate the ith

components (P
m
g)(i) for any given state i, but these may be hard to calcu-
late. However, it turns out that one can use instead single sample approxi-
mations of (P
m
g)(i), and rely on the averaging mechanism of simulation to

improve the approximation process. The details of this are beyond our scope
and we refer to the original sources (Bertsekas and Yu [BeY07], [BeY09]) for
further discussion and specic implementations.
We nally mention the possibility of optimal selection of basis func-
tions within some restricted class. In particular, consider an approximation
subspace
S
=
_
()r [ r
s
_
,
where the s columns of the ns matrix are basis functions parametrized
by a vector . Assume that for a given , there is a corresponding vector
r(), obtained using some algorithm, so that ()r() is an approximation
of a cost function J (various such algorithms will be presented later in
this chapter). Then we may wish to select so that some measure of
approximation quality is optimized. For example, suppose that we can
compute the true cost values J(i) (or more generally, approximations to
these values) for a subset of selected states I. Then we may determine
so that
iI
_
J(i) (i, )
r()
_
2
is minimized, where (i, )
is the ith row of (). Alternatively, we may

determine so that the norm of the error in satisfying Bellmans equation,
_
_
()r() T
_
()r()
__
_
2
,
is minimized. Gradient and random search algorithms for carrying out such
minimizations have been proposed in the literature (see Menache, Mannor,
and Shimkin [MMS06], and Yu and Bertsekas [YuB09]).
6.1.2 Approximate Policy Iteration
Let us consider a form of approximate policy iteration, where we com-
pute simulation-based approximations

J(, r) to the cost functions J
of
stationary policies , and we use them to compute new policies based on
(approximate) policy improvement. We impose no constraints on the ap-
proximation architecture, so

J(i, r) may be linear or nonlinear in r.
Suppose that the current policy is , and for a given r,

J(i, r) is an
approximation of J
(i). We generate an improved policy using the

formula
(i) = arg min
uU(i)
n
j=1
p
ij
(u)
_
g(i, u, j) +

J(j, r)
_
, for all i. (6.5)
The method is illustrated in Fig. 6.1.2. Its theoretical basis was discussed in
Section 1.3 (cf. Prop. 1.3.6), where it was shown that if the policy evaluation
is accurate to within (in the sup-norm sense), then for an -discounted
problem, the method will yield in the limit (after innitely many policy
evaluations) a stationary policy that is optimal to within
2
(1 )
2
,
where is the discount factor. Experimental evidence indicates that this
bound is usually conservative. Furthermore, often just a few policy evalu-
ations are needed before the bound is attained.
When the sequence of policies obtained actually converges to some ,
then it can be proved that is optimal to within
2
1
Approximate Policy
Evaluation
Policy Improvement
Guess Initial Policy
Evaluate Approximate Cost
(r) = r Using Simulation

Generate Improved Policy
Figure 6.1.2 Block diagram of approximate policy iteration.
(see Section 6.3.8 and also Section 6.4.2, where it is shown that if policy
evaluation is done using an aggregation approach, the generated sequence
of policies does converge).
A simulation-based implementation of the algorithm is illustrated in
Fig. 6.1.3. It consists of four parts:
(a) The simulator, which given a state-control pair (i, u), generates the
next state j according to the systems transition probabilities.
(b) The decision generator, which generates the control (i) of the im-
proved policy at the current state i for use in the simulator.
(c) The cost-to-go approximator, which is the function

J(j, r) that is used
by the decision generator.
(d) The cost approximation algorithm, which accepts as input the output
of the simulator and obtains the approximation

J(, r) of the cost of
.
Note that there are two policies and , and parameter vectors r
and r, which are simultaneously involved in this algorithm. In particular,
r corresponds to the current policy , and the approximation

J(, r) is used
in the policy improvement Eq. (6.5) to generate the new policy . At the
same time, drives the simulation that generates samples to be used by
the algorithm that determines the parameter r corresponding to , which
will be used in the next policy iteration.
The Issue of Exploration
Let us note an important generic diculty with simulation-based policy
iteration: to evaluate a policy , we need to generate cost samples using
that policy, but this biases the simulation by underrepresenting states that
System Simulator D
Cost-to-Go Approx
r Decision Generator
roximator Supplies Valu r) Decision (i) S
Cost-to-Go Approximator S
State Cost Approximation
ecision Generator
r Supplies Values

J(j, r) D
i Cost Approximation A
n Algorithm
J(j, r)
State i C
r) Samples
Figure 6.1.3 Simulation-based implementation approximate policy iteration al-
gorithm. Given the approximation

J(i, r), we generate cost samples of the im-
proved policy by simulation (the decision generator module). We use these
samples to generate the approximator

J(i, r) of .
are unlikely to occur under . As a result, the cost-to-go estimates of
these underrepresented states may be highly inaccurate, causing potentially
serious errors in the calculation of the improved control policy via the
policy improvement Eq. (6.5).
The diculty just described is known as inadequate exploration of the
systems dynamics because of the use of a xed policy. It is a particularly
acute diculty when the system is deterministic, or when the randomness
embodied in the transition probabilities is relatively small. One possibil-
ity for guaranteeing adequate exploration of the state space is to frequently
restart the simulation and to ensure that the initial states employed form
a rich and representative subset. A related approach, called iterative re-
sampling, is to enrich the sampled set of states in evaluating the current
policy as follows: derive an initial cost evaluation of , simulate the next
policy obtained on the basis of this initial evaluation to obtain a set of
representative states S visited by , and repeat the evaluation of using
additional trajectories initiated from S.
Still another frequently used approach is to articially introduce some
extra randomization in the simulation, by occasionally using a randomly
generated transition rather than the one dictated by the policy (although
this may not necessarily work because all admissible controls at a given
state may produce similar successor states). This and other possibilities
to improve exploration will be discussed further in Section 6.3.7.
Limited Sampling/Optimistic Policy Iteration
In the approximate policy iteration approach discussed so far, the policy
evaluation of the cost of the improved policy must be fully carried out. An
alternative, known as optimistic policy iteration, is to replace the policy
with the policy after only a few simulation samples have been processed,
at the risk of

J(, r) being an inaccurate approximation of J
.
Optimistic policy iteration has been successfully used, among oth-
ers, in an impressive backgammon application (Tesauro [Tes92]). However,
the associated theoretical convergence properties are not fully understood.
As will be illustrated by the discussion of Section 6.3.8 (see also Section
6.4.2 of [BeT96]), optimistic policy iteration can exhibit fascinating and
counterintuitive behavior, including a natural tendency for a phenomenon
called chattering, whereby the generated parameter sequence r
k
con-
verges, while the generated policy sequence oscillates because the limit of
r
k
corresponds to multiple policies.
We note that optimistic policy iteration tends to deal better with
the problem of exploration discussed earlier, because with rapid changes
of policy, there is less tendency to bias the simulation towards particular
states that are favored by any single policy.
Approximate Policy Iteration Based on Q-Factors
The approximate policy iteration method discussed so far relies on the cal-
culation of the approximation

J(, r) to the cost function J
of the current
policy, which is then used for policy improvement using the minimization
(i) = arg min
uU(i)
n
j=1
p
ij
(u)
_
g(i, u, j) +

J(j, r)
_
.
Carrying out this minimization requires knowledge of the transition proba-
bilities p
ij
(u) and calculation of the associated expected values for all con-
trols u U(i) (otherwise a time-consuming simulation of these expected
values is needed). A model-free alternative is to compute approximate Q-
factors
Q(i, u, r)
n
j=1
p
ij
(u)
_
g(i, u, j) +J
(j)
_
, (6.6)
and use the minimization
(i) = arg min
uU(i)
Q(i, u, r) (6.7)
for policy improvement. Here, r is an adjustable parameter vector and
Q(i, u, r) is a parametric architecture, possibly of the linear form
Q(i, u, r) =
s
k=1
r
k
k
(i, u),
where
k
(i, u) are basis functions that depend on both state and control
[cf. Eq. (6.4)].
The important point here is that given the current policy , we can
construct Q-factor approximations

Q(i, u, r) using any method for con-
structing cost approximations

J(i, r). The way to do this is to apply the
latter method to the Markov chain whose states are the pairs (i, u), and
the probability of transition from (i, u) to (j, v) is
p
ij
(u) if v = (j),
and is 0 otherwise. This is the probabilistic mechanism by which state-
control pairs evolve under the stationary policy .
A major concern with this approach is that the state-control pairs
(i, u) with u ,= (i) are never generated in this Markov chain, so they are
not represented in the cost samples used to construct the approximation
Q(i, u, r) (see Fig. 6.1.4). This creates an acute diculty due to diminished
exploration, which must be carefully addressed in any simulation-based
implementation. We will return to the use of Q-factors in Section 6.5,
where we will discuss exact and approximate implementations of the Q-
learning algorithm.
i, u) States
State-Control Pairs (i, u) States
) States j p
j p
ij
(u)
) g(i, u, j)
v (j)
j)
j, (j)
State-Control Pairs: Fixed Policy

Figure 6.1.4 Markov chain underlying Q-factor-based policy evaluation, associ-
ated with policy . The states are the pairs (i, u), and the probability of transition
from (i, u) to (j, v) is p
ij
(u) if v = (j), and is 0 otherwise. Thus, after the rst
transition, the generated pairs are exclusively of the form (i, (i)); pairs of the
form (i, u), u = (i), are not explored.
The Issue of Policy Oscillations
Contrary to exact policy iteration, which converges to an optimal policy
in a fairly regular manner, approximate policy iteration may oscillate. By
this we mean that after a few iterations, policies tend to repeat in cycles.
The associated parameter vectors r may also tend to oscillate. This phe-
nomenon is explained in Section 6.3.8 and can be particularly damaging,
because there is no guarantee that the policies involved in the oscillation are
good policies, and there is often no way to verify how well they perform
relative to the optimal.
We note that oscillations can be avoided and approximate policy it-
eration can be shown to converge under special conditions that arise in
particular when aggregation is used for policy evaluation. These condi-
tions involve certain monotonicity assumptions regarding the choice of the
matrix , which are fullled in the case of aggregation (see Section 6.3.8,
and also Section 6.4.2). However, when is chosen in an unrestricted man-
ner, as often happens in practical applications of the projected equation
methods of Section 6.3, policy oscillations tend to occur generically, and
often for very simple problems (see Section 6.3.8 for an example).
6.1.3 Direct and Indirect Approximation
We will now preview two general algorithmic approaches for approximating
the cost function of a xed stationary policy within a subspace of the
form S = r [ r
s
. (A third approach, based on aggregation, uses a
special type of matrix and is discussed in Section 6.4.) The rst and most
straightforward approach, referred to as direct , is to nd an approximation
J S that matches best J
in some normed error sense, i.e.,

min
JS
|J

J|,
or equivalently,
min
r
s
|J
r|
(see the left-hand side of Fig. 6.1.5). Here, | | is usually some (possibly
weighted) Euclidean norm, in which case the approximation problem is a
linear least squares problem, whose solution, denoted r
, can in principle be
obtained in closed form by solving the associated quadratic minimization
problem. If the matrix has linearly independent columns, the solution is
unique and can also be represented as
r
= J
,
where denotes projection with respect to || on the subspace S. A major
diculty is that specic cost function values J
(i) can only be estimated

Note that direct approximation may be used in other approximate DP
contexts, such as nite horizon problems, where we use sequential single-stage
approximation of the cost-to-go functions J
k
, going backwards (i.e., starting with
JN, we obtain a least squares approximation of JN1, which is used in turn to
obtain a least squares approximation of JN2, etc). This approach is sometimes
called tted value iteration.
In what follows in this chapter, we will not distinguish between the linear
operation of projection and the corresponding matrix representation, denoting
them both by . The meaning should be clear from the context.
Subspace S = {r | r
s
} Set
= 0
Subspace S = {r | r
s
} Set
= 0
Direct Method: Projection of cost vector J
J
T(r)
r = T(r)
Indirect Method: Solving a projected form of Bellmans equation
Projection on Indirect Method: Solving a projected form of Bellmans equation
Direct Method: Projection of cost vector
( ) ( ) ( ) Direct Method: Projection of cost vector J
Figure 6.1.5 Two methods for approximating the cost function J as a linear
combination of basis functions (subspace S). In the direct method (gure on
the left), J is projected on S. In the indirect method (gure on the right), the
approximation is found by solving r = T(r), a projected form of Bellmans
equation.
through their simulation-generated cost samples, as we discuss in Section
6.2.
An alternative and more popular approach, referred to as indirect ,
is to approximate the solution of Bellmans equation J = T
J on the
subspace S (see the right-hand side of Fig. 6.1.5). An important example
of this approach, which we will discuss in detail in Section 6.3, leads to the
problem of nding a vector r
such that
r
= T
(r
). (6.8)
We can view this equation as a projected form of Bellmans equation. We
will consider another type of indirect approach based on aggregation in
Section 6.4.
We note that solving projected equations as approximations to more
complex/higher-dimensional equations has a long history in scientic com-
putation in the context of Galerkin methods (see e.g., [Kra72]). For exam-
ple, some of the most popular nite-element methods for partial dierential
equations are of this type. However, the use of the Monte Carlo simulation
ideas that are central in approximate DP is an important characteristic
that dierentiates the methods of the present chapter from the Galerkin
methodology.
An important fact here is that T
is a contraction, provided we use

a special weighted Euclidean norm for projection, as will be proved in Sec-
tion 6.3 for discounted problems (Prop. 6.3.1). In this case, Eq. (6.8) has
a unique solution, and allows the use of algorithms such as LSPE() and
TD(), which are discussed in Section 6.3. Unfortunately, the contrac-
tion property of T
does not extend to the case where T
is replaced by
T, the DP mapping corresponding to multiple/all policies, although there
are some interesting exceptions, one of which relates to optimal stopping
problems and is discussed in Section 6.5.3.
6.1.4 Simplications
We now consider various situations where the special structure of the prob-
lem may be exploited to simplify policy iteration or other approximate DP
algorithms.
Problems with Uncontrollable State Components
In many problems of interest the state is a composite (i, y) of two compo-
nents i and y, and the evolution of the main component i can be directly
aected by the control u, but the evolution of the other component y can-
not. Then as discussed in Section 1.4 of Vol. I, the value and the policy
iteration algorithms can be carried out over a smaller state space, the space
of the controllable component i. In particular, we assume that given the
state (i, y) and the control u, the next state (j, z) is determined as follows:
j is generated according to transition probabilities p
ij
(u, y), and z is gen-
erated according to conditional probabilities p(z [ j) that depend on the
main component j of the new state (see Fig. 6.1.6). Let us assume for
notational convenience that the cost of a transition from state (i, y) is of
the form g(i, y, u, j) and does not depend on the uncontrollable component
z of the next state (j, z). If g depends on z it can be replaced by
g(i, y, u, j) =
z
p(z [ j)g(i, y, u, j, z)
in what follows.
) States
) States j p
j p
ij
(u)
Controllable State Components
(i, y) ( ) (j, z) States
j g(i, y, u, j)
) Control u
) No Control
u p(z | j)
Figure 6.1.6 States and transition probabilities for a problem with uncontrollable
state components.
For an -discounted problem, consider the mapping

T dened by
(
T

J)(i) =
y
p(y [ i)(T

J)(i, y)
=
y
p(y [ i) min
uU(i,y)
n
j=0
p
ij
(u, y)
_
g(i, y, u, j) +

J(j)
_
,
and the corresponding mapping for a stationary policy ,
(

J)(i) =
y
p(y [ i)(T
J)(i, y)
=
y
p(y [ i)
n
j=0
p
ij
_
(i, y), y
__
g
_
i, y, (i, y), j
_
+

J(j)
_
.
Bellmans equation, dened over the controllable state component i,
takes the form
J(i) = (
T

J)(i), for all i. (6.9)
The typical iteration of the simplied policy iteration algorithm consists of
two steps:
(a) The policy evaluation step, which given the current policy
k
(i, y),
computes the unique

J
k(i), i = 1, . . . , n, that solve the linear system

of equations

J
k =

T
k

J
k or equivalently
k(i) =
y
p(y [ i)
n
j=0
p
ij
_
k
(i, y)
_
_
g
_
i, y,
k
(i, y), j
_
+

J
k (j)
_
for all i = 1, . . . , n.
(b) The policy improvement step, which computes the improved policy
k+1
(i, y), from the equation

T
k+1

J
k =

T

J
k or equivalently
k+1
(i, y) = arg min
uU(i,y)
n
j=0
p
ij
(u, y)
_
g(i, y, u, j) +

J
k (j)
_
,
for all (i, y).
Approximate policy iteration algorithms can be similarly carried out in
reduced form.
Problems with Post-Decision States
In some stochastic problems, the transition probabilities and stage costs
have the special form
p
ij
(u) = q
_
j [ f(i, u)
_
, (6.10)
where f is some function and q
_
[ f(i, u)
_
is a given probability distribution
for each value of f(i, u). In words, the dependence of the transitions on
(i, u) comes through the function f(i, u). We may exploit this structure by
viewing f(i, u) as a form of state: a post-decision state that determines the
probabilistic evolution to the next state. An example where the conditions
(6.10) are satised are inventory control problems of the type considered in
Section 4.2 of Vol. I. There the post-decision state at time k is x
k
+u
k
, i.e.,
the post-purchase inventory, before any demand at time k has been lled.
Post-decision states can be exploited when the stage cost has no de-
pendence on j, i.e., when we have (with some notation abuse)
g(i, u, j) = g(i, u).
Then the optimal cost-to-go within an -discounted context at state i is
given by
J
(i) = min
uU(i)
_
g(i, u) +V
_
f(i, u)
_
_
,
while the optimal cost-to-go at post-decision state m (optimal sum of costs
of future stages) is given by
V
(m) =
n
j=1
q(j [ m)J
(j).
In eect, we consider a modied problem where the state space is enlarged
to include post-decision states, with transitions between ordinary states
and post-decision states specied by f and q
_
[ f(i, u)
_
(see Fig. 6.1.7).
The preceding two equations represent Bellmans equation for this modied
problem.
Combining these equations, we have
V
(m) =
n
j=1
q(j [ m) min
uU(j)
_
g(j, u) +V
_
f(j, u)
_
_
, m, (6.11)
which can be viewed as Bellmans equation over the space of post-decision
states m. This equation is similar to Q-factor equations, but is dened
If there is dependence on j, one may consider computing, possibly by simu-
lation, (an approximation to) g(i, u) =
n
j=1
pij(u)g(i, u, j), and using it in place
of g(i, u, j).
State-Control Pairs (
State-Control Pairs (i, u) States State-Control Pairs (j, v) States
g(i, u, m)
) m m
Controllable State Components Post-Decision States
m m = f(i, u)
) q(j | m)
No Control v p
No Control u p
Figure 6.1.7 Modied problem where the post-decision states are viewed as
additional states.
over the space of post-decision states rather than the larger space of state-
control pairs. The advantage of this equation is that once the function V
is calculated (or approximated), the optimal policy can be computed as
(i) = arg min

uU(i)
_
g(i, u) +V
_
f(i, u)
_
_
,
which does not require the knowledge of transition probabilities and com-
putation of an expected value. It involves a deterministic optimization,
and it can be used in a model-free context (as long as the functions g and
f are known). This is important if the calculation of the optimal policy is
done on-line.
It is straightforward to construct a policy iteration algorithm that is
dened over the space of post-decision states. The cost-to-go function V
of a stationary policy is the unique solution of the corresponding Bellman

equation
V
(m) =
n
j=1
q(j [ m)
_
g
_
j, (j)
_
+V
_
f
_
j, (j)
__
_
, m.
Given V
, the improved policy is obtained as

(i) = arg min
uU(i)
_
g(i, u) +V
_
f(i, u)
_
_
, i = 1, . . . , n.
There are also corresponding approximate policy iteration methods with
cost function approximation.
An advantage of this method when implemented by simulation is that
the computation of the improved policy does not require the calculation
of expected values. Moreover, with a simulator, the policy evaluation of
V
can be done in model-free fashion, without explicit knowledge of the

probabilities q(j [ m). These advantages are shared with policy iteration
algorithms based on Q-factors. However, when function approximation is
used in policy iteration, the methods using post-decision states may have a
signicant advantage over Q-factor-based methods: they use cost function
approximation in the space of post-decision states, rather than the larger
space of state-control pairs, and they are less susceptible to diculties due
to inadequate exploration.
We note that there is a similar simplication with post-decision states
when g is of the form
g(i, u, j) = h
_
f(i, u), j
_
,
for some function h. Then we have
J
(i) = min
uU(i)
V
_
f(i, u)
_
,
where V
is the unique solution of the equation

V
(m) =
n
j=1
q(j [ m)
_
h(m, j) + min
uU(j)
V
_
f(j, u)
_
_
, m.
Here V
(m) should be interpreted as the optimal cost-to-go from post-

decision state m, including the cost h(m, j) incurred within the stage when
m was generated. When h does not depend on j, the algorithm takes the
simpler form
V
(m) = h(m) +
n
j=1
q(j [ m) min
uU(j)
V
_
f(j, u)
_
, m. (6.12)
Example 6.1.4 (Tetris)
Let us revisit the game of tetris, which was discussed in Example 1.4.1 of Vol.
I in the context of problems with an uncontrollable state component. We
will show that it also admits a post-decision state. Assuming that the game
terminates with probability 1 for every policy (a proof of this has been given
by Burgiel [Bur97]), we can model the problem of nding an optimal tetris
playing strategy as a stochastic shortest path problem.
The state consists of two components:
(1) The board position, i.e., a binary description of the full/empty status
of each square, denoted by x.
(2) The shape of the current falling block, denoted by y (this is the uncon-
trollable component).
The control, denoted by u, is the horizontal positioning and rotation applied
to the falling block.
Bellmans equation over the space of the controllable state component
takes the form
J(x) =
y
p(y) max
u
_
g(x, y, u) +

J
_
f(x, y, u)
_
_
, for all x,
where g(x, y, u) and f(x, y, u) are the number of points scored (rows removed),
and the board position when the state is (x, y) and control u is applied,
respectively [cf. Eq. (6.9)].
This problem also admits a post-decision state. Once u is applied at
state (x, y), a new board position m is obtained, and the new state component
x is obtained from m after removing a number of rows. Thus we have
m = f(x, y, u)
for some function f, and m also determines the reward of the stage, which
has the form h(m) for some m [h(m) is the number of complete rows that
can be removed from m]. Thus, m may serve as a post-decision state, and
the corresponding Bellmans equation takes the form (6.12), i.e.,
V
(m) = h(m) +
n
(x,y)
q(m, x, y) max
uU(j)
V
_
f(x, y, u)
_
, m,
where (x, y) is the state that follows m, and q(m, x, y) are the corresponding
transition probabilities. Note that both of the simplied Bellmans equations
share the same characteristic: they involve a deterministic optimization.
Trading o Complexity of Control Space with Complexity of
State Space
Suboptimal control using cost function approximation deals fairly well with
large state spaces, but still encounters serious diculties when the number
of controls available at each state is large. In particular, the minimization
min
uU(i)
n
j=1
p
ij
(u)
_
g(i, u, j) +

J(j, r)
_
using an approximate cost-go function

J(j, r) may be very time-consuming.
For multistep lookahead schemes, the diculty is exacerbated, since the
required computation grows exponentially with the size of the lookahead
horizon. It is thus useful to know that by reformulating the problem, it
may be possible to reduce the complexity of the control space by increasing
the complexity of the state space. The potential advantage is that the
extra state space complexity may still be dealt with by using function
approximation and/or rollout.
In particular, suppose that the control u consists of m components,
u = (u
1
, . . . , u
m
).
Then, at a given state i, we can break down u into the sequence of the
m controls u
1
, u
2
, . . . , u
m
, and introduce articial intermediate states
(i, u
1
), (i, u
1
, u
2
), . . . , (i, u
1
, . . . , u
m1
), and corresponding transitions to mo-
del the eect of these controls. The choice of the last control component
u
m
at state (i, u
1
, . . . , u
m1
) marks the transition to state j according
to the given transition probabilities p
ij
(u). In this way the control space is
simplied at the expense of introducing m 1 additional layers of states,
and m1 additional cost-to-go functions
J
1
(i, u
1
), J
2
(i, u
1
, u
2
), . . . , J
m1
(i, u
1
, . . . , u
m1
).
To deal with the increase in size of the state space we may use rollout, i.e.,
when at state (i, u
1
, . . . , u
k
), assume that future controls u
k+1
, . . . , u
m
will be chosen by a base heuristic. Alternatively, we may use function
approximation, that is, introduce cost-to-go approximations
J
1
(i, u
1
, r
1
),

J
2
(i, u
1
, u
2
, r
2
), . . . ,

J
m1
(i, u
1
, . . . , u
m1
, r
m1
),
in addition to

J(i, r). We refer to [BeT96], Section 6.1.4, for further dis-
cussion.
A potential complication in the preceding schemes arises when the
controls u
1
, . . . , u
m
are coupled through a constraint of the form
u = (u
1
, . . . , u
m
) U(i). (6.13)
Then, when choosing a control u
k
, care must be exercised to ensure that
the future controls u
k+1
, . . . , u
m
can be chosen together with the already
chosen controls u
1
, . . . , u
k
to satisfy the feasibility constraint (6.13). This
requires a variant of the rollout algorithm that works with constrained DP
problems; see Exercise 6.19 of Vol. I, and also references [Ber05a], [Ber05b].
6.1.5 Monte Carlo Simulation
In this subsection and the next, we will try to provide some orientation
into the mathematical content of this chapter. The reader may wish to
skip these subsections at rst, but return to them later for a higher level
view of some of the subsequent technical material.
The methods of this chapter rely to a large extent on simulation in
conjunction with cost function approximation in order to deal with large
state spaces. The advantage that simulation holds in this regard can be
traced to its ability to compute (approximately) sums with a very large
number of terms. These sums arise in a number of contexts: inner product
and matrix-vector product calculations, the solution of linear systems of
equations and policy evaluation, linear least squares problems, etc.
Example 6.1.5 (Approximate Policy Evaluation)
Consider the approximate solution of the Bellman equation that corresponds
to a given policy of an n-state discounted problem:
J = g +PJ;
where P is the transition probability matrix and is the discount factor.
Let us adopt a hard aggregation approach (cf. Section 6.3.4 of Vol. I; see
also Section 6.4 later in this chapter), whereby we divide the n states in two
disjoint subsets I1 and I2 with I1 I2 = {1, . . . , n}, and we use the piecewise
constant approximation
J(i) =
_
r1 if i I1,
r2 if i I2.
This corresponds to the linear feature-based architecture J r, where
is the n 2 matrix with column components equal to 1 or 0, depending on
whether the component corresponds to I1 or I2.
We obtain the approximate equations
J(i) g(i) +
_
_
jI
1
pij
_
_
r1 +
_
_
jI
2
pij
_
_
r2, i = 1, . . . , n,
which we can reduce to just two equations by forming two weighted sums
(with equal weights) of the equations corresponding to the states in I1 and
I2, respectively:
r1
1
n1
iI
1
J(i), r2
1
n2
iI
2
J(i),
where n1 and n2 are numbers of states in I1 and I2, respectively. We thus
obtain the aggregate system of the following two equations in r1 and r2:
r1 =
1
n1
iI
1
g(i) +

n1
_
_
iI
1
jI
1
pij
_
_
r1 +

n1
_
_
iI
1
jI
2
pij
_
_
r2,
r2 =
1
n2
iI
2
g(i) +

n2
_
_
iI
2
jI
1
pij
_
_
r1 +

n2
_
_
iI
2
jI
2
pij
_
_
r2.
Here the challenge, when the number of states n is very large, is the calcu-
lation of the large sums in the right-hand side, which can be of order O(n
2
).
Simulation allows the approximate calculation of these sums with complexity
that is independent of n. This is similar to the advantage that Monte-Carlo
integration holds over numerical integration, as discussed in standard texts
on Monte-Carlo methods.
To see how simulation can be used with advantage, let us consider
the problem of estimating a scalar sum of the form
z =
v(),
where is a nite set and v : is a function of . We introduce a
distribution that assigns positive probability () to every element
(but is otherwise arbitrary), and we generate a sequence
1
, . . . ,
T
of samples from , with each sample

t
taking values from according to
. We then estimate z with
z
T
=
1
T
T
t=1
v(
t
)
(
t
)
. (6.14)
Clearly z is unbiased:
E[ z
T
] =
1
T
T
t=1
E
_
v(
t
)
(
t
)
_
=
1
T
T
t=1
()
v()
()
=
v() = z.
Suppose now that the samples are generated in a way that the long-
term frequency of each is equal to (), i.e.,
lim
T
T
t=1
(
t
= )
T
= (), , (6.15)
where () denotes the indicator function [(E) = 1 if the event E has
occurred and (E) = 0 otherwise]. Then from Eq. (6.14), we have
z
T
=
t=1
(
t
= )
T

v()
()
,
and by taking limit as T and using Eq. (6.15),
lim
T
z
T
=
lim
T
T
t=1
(
t
= )
T

v()
()
=
v() = z.
Thus in the limit, as the number of samples increases, we obtain the desired
sum z. An important case, of particular relevance to the methods of this
chapter, is when is the set of states of an irreducible Markov chain. Then,
if we generate an innitely long trajectory
1
,
2
, . . . starting from any
initial state
1
, then the condition (6.15) will hold with probability 1, with
() being the steady-state probability of state .
The samples
t
need not be independent for the preceding properties
to hold, but if they are, then the variance of z
T
is the sum of the variances
of the independent components in the sum of Eq. (6.14), and is given by
var( z
T
) =
1
T
2
T
t=1
()
_
v()
()
z
_
2
=
1
T
()
_
v()
()
z
_
2
.
(6.16)
An important observation from this formula is that the accuracy of the
approximation does not depend on the number of terms in the sum z (the
number of elements in ), but rather depends on the variance of the random
variable that takes values v()/(), , with probabilities (). Thus,
it is possible to execute approximately linear algebra operations of very
large size through Monte Carlo sampling (with whatever distributions may
be convenient in a given context), and this a principal idea underlying the
methods of this chapter.
In the case where the samples are dependent, the variance formula
(6.16) does not hold, but similar qualitative conclusions can be drawn under
various assumptions, which ensure that the dependencies between samples
become suciently weak over time (see the specialized literature).
Monte Carlo simulation is also important in the context of this chap-
ter for an additional reason. In addition to its ability to compute eciently
sums of very large numbers of terms, it can often do so in model-free fash-
ion (i.e., by using a simulator, rather than an explicit model of the terms
in the sum).
6.1.6 Contraction Mappings and Simulation
Most of the chapter (Sections 6.3-6.8) deals with the approximate com-
putation of a xed point of a (linear or nonlinear) mapping T within a
The selection of the distribution
_
() |
_
can be optimized (at least
approximately), and methods for doing this are the subject of the technique of
importance sampling. In particular, assuming that samples are independent and
that v() 0 for all , we have from Eq. (6.16) that the optimal distribution
is
= v/z and the corresponding minimum variance value is 0. However,
cannot be computed without knowledge of z. Instead, is usually chosen to be

an approximation to v, normalized so that its components add to 1. Note that
we may assume that v() 0 for all without loss of generality: when v
takes negative values, we may decompose v as
v = v
+
v
,
so that both v
+
and v
are positive functions, and then estimate separately

z
+
=
v
+
() and z
().
subspace
S = r [ r
s
.
We will discuss a variety of approaches with distinct characteristics, but at
an abstract mathematical level, these approaches fall into two categories:
(a) A projected equation approach, based on the equation
r = T(r), (6.17)
where is a projection operation with respect to a Euclidean norm
(see Section 6.3 for discounted problems, and Sections 7.1-7.3 for other
types of problems).
(b) An aggregation approach, based on an equation of the form
r = DT(r), (6.18)
where D is an s n matrix whose rows are probability distributions
and are matrices that satisfy certain restrictions.
When iterative methods are used for solution of Eqs. (6.17) and (6.18),
it is important that T and DT be contractions over the subspace S.
Note here that even if T is a contraction mapping (as is ordinarily the
case in DP), it does not follow that T and DT are contractions. In
our analysis, this is resolved by requiring that T be a contraction with
respect to a norm such that or D, respectively, is a nonexpansive
mapping. As a result, we need various assumptions on T, , and D, which
guide the algorithmic development. We postpone further discussion of these
issues, but for the moment we note that the projection approach revolves
mostly around Euclidean norm contractions and cases where T is linear,
while the aggregation/Q-learning approach revolves mostly around sup-
norm contractions.
If T is linear, both equations (6.17) and (6.18) may be written as
square systems of linear equations of the form Cr = d, whose solution can
be approximated by simulation. The approach here is very simple: we
approximate C and d with simulation-generated approximations

C and

d,
and we solve the resulting (approximate) linear system

Cr =

d by matrix
inversion, thereby obtaining the solution estimate r =

C
1
d. A primary
example is the LSTD methods of Section 6.3.4. We may also try to solve
the linear system

Cr =

d iteratively, which leads to the LSPE type of
methods, some of which produce estimates of r simultaneously with the
generation of the simulation samples of w (see Section 6.3.4).
Stochastic Approximation Methods
Let us also mention some stochastic iterative algorithms that are based
on a somewhat dierent simulation idea, and fall within the framework of
stochastic approximation methods. The TD() and Q-learning algorithms
fall in this category. For an informal orientation, let us consider the com-
putation of the xed point of a general mapping F :
n

n
that is a
contraction mapping with respect to some norm, and involves an expected
value: it has the form
F(x) = E
_
f(x, w)
_
, (6.19)
where x
n
is a generic argument of F, w is a random variable and f(, w)
is a given function. Assume for simplicity that w takes values in a nite
set W with probabilities p(w), so that the xed point equation x = F(x)
has the form
x =
wW
p(w)f(x, w).
We generate a sequence of samples w
1
, w
2
, . . . such that the empirical
frequency of each value w W is equal to its probability p(w), i.e.,
lim
k
n
k
(w)
k
= p(w), w W,
where n
k
(w) denotes the number of times that w appears in the rst k
samples w
1
, . . . , w
k
. This is a reasonable assumption that may be veried
by application of various laws of large numbers to the sampling method at
hand.
Given the samples, we may consider approximating the xed point of
F by the (approximate) xed point iteration
x
k+1
=
wW
n
k
(w)
k
f(x
k
, w), (6.20)
which can also be equivalently written as
x
k+1
=
1
k
k
i=1
f(x
k
, w
i
). (6.21)
We may view Eq. (6.20) as a simulation-based version of the convergent
xed point iteration
x
k+1
= F(x
k
) =
wW
p(w)f(x
k
, w),
where the probabilities p(w) have been replaced by the empirical frequen-
cies
n
k
(w)
k
. Thus we expect that the simulation-based iteration (6.21) con-
verges to the xed point of F.
On the other hand the iteration (6.21) has a major aw: it requires,
for each k, the computation of f(x
k
, w
i
) for all sample values w
i
, i =
Sec. 6.2 Direct Policy Evaluation - Gradient Methods 351
1, . . . , k. An algorithm that requires much less computation than iteration
(6.21) is
x
k+1
=
1
k
k
i=1
f(x
i
, w
i
), k = 1, 2, . . . , (6.22)
where only one value of f per sample w
i
is computed. This iteration can
also be written in the simple recursive form
x
k+1
= (1
k
)x
k
+
k
f(x
k
, w
k
), k = 1, 2, . . . , (6.23)
with the stepsize
k
having the form
k
= 1/k. As an indication of its
validity, we note that if it converges to some limit then this limit must be
the xed point of F, since for large k the iteration (6.22) becomes essentially
identical to the iteration x
k+1
= F(x
k
). Other stepsize rules, which satisfy
k
0 and
k=1

k
= , may also be used. However, a rigorous analysis
of the convergence of iteration (6.23) is nontrivial and is beyond our scope.
The book by Bertsekas and Tsitsiklis [BeT96] contains a fairly detailed
development, which is tailored to DP. Other more general references are
Benveniste, Metivier, and Priouret [BMP90], Borkar [Bor08], Kushner and
Yin [KuY03], and Meyn [Mey07].
6.2 DIRECT POLICY EVALUATION - GRADIENT METHODS
We will now consider the direct approach for policy evaluation. In par-
ticular, suppose that the current policy is , and for a given r,

J(i, r) is
an approximation of J
(i). We generate an improved policy using the

formula
(i) = arg min
uU(i)
n
j=1
p
ij
(u)
_
g(i, u, j) +

J(j, r)
_
, for all i. (6.24)
To evaluate approximately J
, we select a subset of representative states
S (perhaps obtained by some form of simulation), and for each i

S, we
obtain M(i) samples of the cost J
(i). The mth such sample is denoted by

Direct policy evaluation methods have been historically important, and
provide an interesting contrast with indirect methods. However, they are cur-
rently less popular than the projected equation methods to be considered in the
next section, despite some generic advantages (the option to use nonlinear ap-
proximation architectures, and the capability of more accurate approximation).
The material of this section will not be substantially used later, so the reader
may read lightly this section without loss of continuity.
c(i, m), and mathematically, it can be viewed as being J
(i) plus some sim-

ulation error/noise. Then we obtain the corresponding parameter vector
r by solving the following least squares problem
min
r
S
M(i)
m=1
_
J(i, r) c(i, m)
_
2
, (6.25)
and we repeat the process with and r replacing and r, respectively (see
Fig. 6.1.1).
The least squares problem (6.25) can be solved exactly if a linear
approximation architecture is used, i.e., if
J(i, r) = (i)
r,
where (i)
is a row vector of features corresponding to state i. In this case

r is obtained by solving the linear system of equations
S
M(i)
m=1
(i)
_
(i)
r c(i, m)
_
= 0,
which is obtained by setting to 0 the gradient with respect to r of the
quadratic cost in the minimization (6.25). When a nonlinear architecture
is used, we may use gradient-like methods for solving the least squares
problem (6.25), as we will now discuss.
Batch Gradient Methods for Policy Evaluation
Let us focus on an N-transition portion (i
0
, . . . , i
N
) of a simulated trajec-
tory, also called a batch. We view the numbers
N1
t=k
tk
g
_
i
t
, (i
t
), i
t+1
_
, k = 0, . . . , N 1,
The manner in which the samples c(i, m) are collected is immaterial for
the purposes of the subsequent discussion. Thus one may generate these samples
through a single very long trajectory of the Markov chain corresponding to , or
one may use multiple trajectories, with dierent starting points, to ensure that
enough cost samples are generated for a representative subset of states. In
either case, the samples c(i, m) corresponding to any one state i will generally be
correlated as well as noisy. Still the average
1
M(i)
M(i)
m=1
c(i, m) will ordinarily
converge to J(i) as M(i) by a law of large numbers argument [see Exercise
6.2 and the discussion in [BeT96], Sections 5.1, 5.2, regarding the behavior of the
average when M(i) is nite and random].
as cost samples, one per initial state i
0
, . . . , i
N1
, which can be used for
least squares approximation of the parametric architecture

J(i, r) [cf. Eq.
(6.25)]:
min
r
N1
k=0
1
2
_
J(i
k
, r)
N1
t=k
tk
g
_
i
t
, (i
t
), i
t+1
_
_
2
. (6.26)
One way to solve this least squares problem is to use a gradient method,
whereby the parameter r associated with is updated at time N by
r := r
N1
k=0
J(i
k
, r)
_
J(i
k
, r)
N1
t=k
tk
g
_
i
t
, (i
t
), i
t+1
_
_
. (6.27)
Here,
J denotes gradient with respect to r and is a positive stepsize,

which is usually diminishing over time (we leave its precise choice open for
the moment). Each of the N terms in the summation in the right-hand
side above is the gradient of a corresponding term in the least squares
summation of problem (6.26). Note that the update of r is done after
processing the entire batch, and that the gradients
J(i
k
, r) are evaluated
at the preexisting value of r, i.e., the one before the update.
In a traditional gradient method, the gradient iteration (6.27) is
repeated, until convergence to the solution of the least squares problem
(6.26), i.e., a single N-transition batch is used. However, there is an im-
portant tradeo relating to the size N of the batch: in order to reduce
simulation error and generate multiple cost samples for a representatively
large subset of states, it is necessary to use a large N, yet to keep the work
per gradient iteration small it is necessary to use a small N.
To address the issue of size of N, an expanded view of the gradient
method is preferable in practice, whereby batches may be changed after one
or more iterations. Thus, in this more general method, the N-transition
batch used in a given gradient iteration comes from a potentially longer
simulated trajectory, or from one of many simulated trajectories. A se-
quence of gradient iterations is performed, with each iteration using cost
samples formed from batches collected in a variety of dierent ways and
whose length N may vary. Batches may also overlap to a substantial degree.
We leave the method for generating simulated trajectories and form-
ing batches open for the moment, but we note that it inuences strongly
the result of the corresponding least squares optimization (6.25), provid-
ing better approximations for the states that arise most frequently in the
batches used. This is related to the issue of ensuring that the state space is
adequately explored, with an adequately broad selection of states being
represented in the least squares optimization, cf. our earlier discussion on
the exploration issue.
The gradient method (6.27) is simple, widely known, and easily un-
derstood. There are extensive convergence analyses of this method and
its variations, for which we refer to the literature cited at the end of the
chapter. These analyses often involve considerable mathematical sophis-
tication, particularly when multiple batches are involved, because of the
stochastic nature of the simulation and the complex correlations between
the cost samples. However, qualitatively, the conclusions of these analyses
are consistent among themselves as well as with practical experience, and
indicate that:
(1) Under some reasonable technical assumptions, convergence to a lim-
iting value of r that is a local minimum of the associated optimization
problem is expected.
(2) For convergence, it is essential to gradually reduce the stepsize to 0,
the most popular choice being to use a stepsize proportional to 1/m,
while processing the mth batch. In practice, considerable trial and
error may be needed to settle on an eective stepsize choice method.
Sometimes it is possible to improve performance by using a dierent
stepsize (or scaling factor) for each component of the gradient.
(3) The rate of convergence is often very slow, and depends among other
things on the initial choice of r, the number of states and the dynamics
of the associated Markov chain, the level of simulation error, and
the method for stepsize choice. In fact, the rate of convergence is
sometimes so slow, that practical convergence is infeasible, even if
theoretical convergence is guaranteed.
Incremental Gradient Methods for Policy Evaluation
We will now consider a variant of the gradient method called incremental .
This method can also be described through the use of N-transition batches,
but we will see that (contrary to the batch version discussed earlier) the
method is suitable for use with very long batches, including the possibility
of a single very long simulated trajectory, viewed as a single batch.
For a given N-transition batch (i
0
, . . . , i
N
), the batch gradient method
processes the N transitions all at once, and updates r using Eq. (6.27). The
incremental method updates r a total of N times, once after each transi-
tion. Each time it adds to r the corresponding portion of the gradient in
the right-hand side of Eq. (6.27) that can be calculated using the newly
available simulation data. Thus, after each transition (i
k
, i
k+1
):
(1) We evaluate the gradient
J(i
k
, r) at the current value of r.
(2) We sum all the terms in the right-hand side of Eq. (6.27) that involve
the transition (i
k
, i
k+1
), and we update r by making a correction
along their sum:
r := r
_
J(i
k
, r)

J(i
k
, r)
_
k
t=0
kt
J(i
t
, r)
_
g
_
i
k
, (i
k
), i
k+1
_
_
.
(6.28)
By adding the parenthesized incremental correction terms in the above
iteration, we see that after N transitions, all the terms of the batch iter-
ation (6.27) will have been accumulated, but there is a dierence: in the
incremental version, r is changed during the processing of the batch, and
the gradient
J(i
t
, r) is evaluated at the most recent value of r [after the
transition (i
t
, i
t+1
)]. By contrast, in the batch version these gradients are
evaluated at the value of r prevailing at the beginning of the batch. Note
that the gradient sum in the right-hand side of Eq. (6.28) can be conve-
niently updated following each transition, thereby resulting in an ecient
implementation.
It can now be seen that because r is updated at intermediate transi-
tions within a batch (rather than at the end of the batch), the location of
the end of the batch becomes less relevant. It is thus possible to have very
long batches, and indeed the algorithm can be operated with a single very
long simulated trajectory and a single batch. In this case, for each state
i, we will have one cost sample for every time when state i is encountered
in the simulation. Accordingly state i will be weighted in the least squares
optimization in proportion to the frequency of its occurrence within the
simulated trajectory.
Generally, within the least squares/policy evaluation context of this
section, the incremental versions of the gradient methods can be imple-
mented more exibly and tend to converge faster than their batch counter-
parts, so they will be adopted as the default in our discussion. The book
by Bertsekas and Tsitsiklis [BeT96] contains an extensive analysis of the
theoretical convergence properties of incremental gradient methods (they
are fairly similar to those of batch methods), and provides some insight into
the reasons for their superior performance relative to the batch versions; see
also the authors nonlinear programming book [Ber99] (Section 1.5.2), the
paper by Bertsekas and Tsitsiklis [BeT00], and the authors recent survey
[Ber10d]. Still, however, the rate of convergence can be very slow.
Implementation Using Temporal Dierences TD(1)
We now introduce an alternative, mathematically equivalent, implemen-
tation of the batch and incremental gradient iterations (6.27) and (6.28),
which is described with cleaner formulas. It uses the notion of temporal
dierence (TD for short) given by
q
k
=

J(i
k
, r)

J(i
k+1
, r)g
_
i
k
, (i
k
), i
k+1
_
, k = 0, . . . , N2, (6.29)
q
N1
=

J(i
N1
, r) g
_
i
N1
, (i
N1
), i
N
_
. (6.30)
In particular, by noting that the parenthesized term multiplying
J(i
k
, r)
in Eq. (6.27) is equal to
q
k
+q
k+1
+ +
N1k
q
N1
,
we can verify by adding the equations below that iteration (6.27) can also
be implemented as follows:
After the state transition (i
0
, i
1
), set
r := r q
0
J(i
0
, r).
After the state transition (i
1
, i
2
), set
r := r q
1
_
J(i
0
, r) +
J(i
1
, r)
_
.
Proceeding similarly, after the state transition (i
N1
, t), set
r := r q
N1
_
N1
J(i
0
, r) +
N2
J(i
1
, r) + +
J(i
N1
, r)
_
.
The batch version (6.27) is obtained if the gradients
J(i
k
, r) are
all evaluated at the value of r that prevails at the beginning of the batch.
The incremental version (6.28) is obtained if each gradient
J(i
k
, r) is
evaluated at the value of r that prevails when the transition (i
k
, i
k+1
) is
processed.
In particular, for the incremental version, we start with some vector
r
0
, and following the transition (i
k
, i
k+1
), k = 0, . . . , N 1, we set
r
k+1
= r
k

k
q
k
k
t=0
kt
J(i
t
, r
t
), (6.31)
where the stepsize
k
may very from one transition to the next. In the
important case of a linear approximation architecture of the form
J(i, r) = (i)
r, i = 1, . . . , n,
where (i)
s
are some xed vectors, it takes the form
r
k+1
= r
k

k
q
k
k
t=0
kt
(i
t
). (6.32)
This algorithm is known as TD(1), and we will see in Section 6.3.6 that it
is a limiting version (as 1) of the TD() method discussed there.
Sec. 6.3 Projected Equation Methods 357
6.3 PROJECTED EQUATION METHODS
In this section, we consider the indirect approach, whereby the policy eval-
uation is based on solving a projected form of Bellmans equation (cf. the
right-hand side of Fig. 6.1.5). We will be dealing with a single station-
ary policy , so we generally suppress in our notation the dependence on
control of the transition probabilities and the cost per stage. We thus con-
sider a stationary nite-state Markov chain, and we denote the states by
i = 1, . . . , n, the transition probabilities by p
ij
, i, j = 1, . . . , n, and the stage
costs by g(i, j). We want to evaluate the expected cost of corresponding
to each initial state i, given by
J
(i) = lim
N
E
_
N1
k=0
k
g(i
k
, i
k+1
)
i
0
= i
_
, i = 1, . . . , n,
where i
k
denotes the state at time k, and (0, 1) is the discount factor.
We approximate J
(i) with a linear architecture of the form
J(i, r) = (i)
r, i = 1, . . . , n, (6.33)
where r is a parameter vector and (i) is an s-dimensional feature vector
associated with the state i. (Throughout this section, vectors are viewed
as column vectors, and a prime denotes transposition.) As earlier, we also
write the vector
_
J(1, r), . . . ,

J(n, r)
_
in the compact form r, where is the n s matrix that has as rows the
feature vectors (i), i = 1, . . . , n. Thus, we want to approximate J
within
S = r [ r
s
,
the subspace spanned by s basis functions, the columns of . Our as-
sumptions in this section are the following (we will later discuss how our
methodology may be modied in the absence of these assumptions).
Assumption 6.3.1: The Markov chain has steady-state probabilities
1
, . . . ,
n
, which are positive, i.e., for all i = 1, . . . , n,
lim
N
1
N
N
k=1
P(i
k
= j [ i
0
= i) =
j
> 0, j = 1, . . . , n.
Assumption 6.3.2: The matrix has rank s.
Assumption 6.3.1 is equivalent to assuming that the Markov chain is
irreducible, i.e., has a single recurrent class and no transient states. As-
sumption 6.3.2 is equivalent to the basis functions (the columns of ) being
linearly independent, and is analytically convenient because it implies that
each vector J in the subspace S is represented in the form r with a unique
vector r.
6.3.1 The Projected Bellman Equation
We will now introduce the projected form of Bellmans equation. We use
a weighted Euclidean norm on
n
of the form
|J|
v
=
_
n
i=1
v
i
_
J(i)
_
2
,
where v is a vector of positive weights v
1
, . . . , v
n
. Let denote the projec-
tion operation onto S with respect to this norm. Thus for any J
n
, J
is the unique vector in S that minimizes |J

J|
2
v
over all

J S. It can
also be written as
J = r
J
,
where
r
J
= arg min
r
s
|J r|
2
v
, J
n
. (6.34)
This is because has rank s by Assumption 6.3.2, so a vector in S is
uniquely written in the form r.
Note that and r
J
can be written explicitly in closed form. This can
be done by setting to 0 the gradient of the quadratic function
|J r|
2
v
= (J r)
V (J r),
where V is the diagonal matrix with v
i
, i = 1, . . . , n, along the diagonal
[cf. Eq. (6.34)]. We thus obtain the necessary and sucient optimality
condition
V (J r
J
) = 0, (6.35)
from which
r
J
= (
V )
1
V J,
and using the formula r
J
= J,
= (
V )
1
V.
[The inverse (
V )
1
exists because is assumed to have rank s; cf.
Assumption 6.3.2.] The optimality condition (6.35), through left multipli-
cation with r
, can also be equivalently expressed as

(r)
V (J r
J
) = 0, r S. (6.36)
The interpretation is that the dierence/approximation error J r
J
is
orthogonal to the subspace S in the scaled geometry of the norm | |
v
(two
vectors x, y
n
are called orthogonal if x
V y =
n
i=1
v
i
x
i
y
i
= 0).
Consider now the mapping T given by
(TJ)(i) =
n
i=1
p
ij
_
g(i, j) +J(j)
_
, i = 1, . . . , n,
the mapping T (the composition of with T), and the equation
r = T(r). (6.37)
We view this as a projected/approximate form of Bellmans equation, and
we view a solution r
of this equation as an approximation to J
. Note
that r
depends only on the projection norm and the subspace S, and

not on the matrix , which provides just an algebraic representation of S,
i.e., all matrices whose range space is S result in identical vectors r
).
We know from Section 1.4 that T is a contraction with respect to
the sup-norm, but unfortunately this does not necessarily imply that T
is a contraction with respect to the norm | |
v
. We will next show an
important fact: if v is chosen to be the steady-state probability vector ,
then T is a contraction with respect to | |
v
, with modulus . The critical
part of the proof is addressed in the following lemma.
Lemma 6.3.1: For any n n stochastic matrix P that has a steady-
state probability vector = (
1
, . . . ,
n
) with positive components, we
have
|Pz|
|z|
, z
n
.
Proof: Let p
ij
be the components of P. For all z
n
, we have
|Pz|
2
=
n
i=1
i
_
_
n
j=1
p
ij
z
j
_
_
2
i=1
i
n
j=1
p
ij
z
2
j
=
n
j=1
n
i=1
i
p
ij
z
2
j
=
n
j=1
j
z
2
j
= |z|
2
,
where the inequality follows from the convexity of the quadratic func-
tion, and the next to last equality follows from the dening property
n
i=1
i
p
ij
=
j
of the steady-state probabilities. Q.E.D.
We next note an important property of projections: they are nonex-
pansive, in the sense
|J

J|
v
|J

J|
v
, for all J,

J
n
.
To see this, note that by using the linearity of , we have
|J

J|
2
v
=
_
_
(J

J)
_
_
2
v

_
_
(J

J)
_
_
2
v
+
_
_
(I )(J

J)
_
_
2
v
= |J

J|
2
v
,
where the rightmost equality follows from the Pythagorean Theorem:
|X|
2
v
= |X|
2
v
+|(I )X|
2
v
, for all X
n
, (6.38)
applied with X = J

J. Thus, for T to be a contraction with respect to
| |
v
, it is sucient that T be a contraction with respect to | |
v
, since
|TJ T

J|
v
|TJ T

J|
v
|J

J|
v
,
where is the modulus of contraction of T with respect to | |
v
(see Fig.
6.3.1). This leads to the following proposition.
Proposition 6.3.1: The mappings T and T are contractions of
modulus with respect to the weighted Euclidean norm | |
, where
is the steady-state probability vector of the Markov chain.
Proof: We write T in the form TJ = g + PJ, where g is the vector
with components
n
j=1
p
ij
g(i, j), i = 1, . . . , n, and P is the matrix with
components p
ij
. Then we have for all J,

J
n
,
TJ T

J = P(J

J).
We thus obtain
|TJ T

J|
= |P(J

J)|
|J

J|
,
where the inequality follows from Lemma 6.3.1. Hence T is a contraction
of modulus . The contraction property of T follows from the contrac-
tion property of T and the nonexpansiveness property of noted earlier.
Q.E.D.
The Pythagorean Theorem follows from the orthogonality of the vectors
X and (I )X in the scaled geometry of the norm v.
Subspace S = {r | r
s
} Set
J TJ
J TJ
J TJ TJ
TJ

J T
J T

J
J T

J = 0
Figure 6.3.1 Illustration of the contraction property of T due to the nonex-
pansiveness of . If T is a contraction with respect to v, the Euclidean norm
used in the projection, then T is also a contraction with respect to that norm,
since is nonexpansive and we have
TJ T

Jv TJ T

Jv J

Jv,
where is the modulus of contraction of T with respect to v.
The next proposition gives an estimate of the error in estimating J
with the xed point of T.

Proposition 6.3.2: Let r
be the xed point of T. We have

|J
1
2
|J
.
Proof: We have
|J
|
2
= |J
|
2
+
_
_
J
_
_
2
= |J
|
2
+
_
_
TJ
T(r
)
_
_
2
|J
|
2
+
2
|J
|
2
,
where the rst equality uses the Pythagorean Theorem [cf. Eq. (6.38) with
X = J
], the second equality holds because J
is the xed point of

T and r
is the xed point of T, and the inequality uses the contraction

property of T. From this relation, the result follows. Q.E.D.
Note the critical fact in the preceding analysis: P (and hence T)
is a contraction with respect to the projection norm | |
(cf. Lemma
6.3.1). Indeed, Props. 6.3.1 and 6.3.2 hold if T is any (possibly nonlinear)
contraction with respect to the Euclidean norm of the projection (cf. Fig.
6.3.1).
The Matrix Form of the Projected Bellman Equation
Let us now write the projected Bellman equation r = T(r) in explicit
form. We note that this is a linear equation, since the projection is linear
and also T is linear of the form
TJ = g +PJ,
where g is the vector with components
n
j=1
p
ij
g(i, j), i = 1, . . . , n, and
P is the matrix with components p
ij
. The solution of the projected Bell-
man equation is the vector J = r
, where r
satises the orthogonality

condition
_
r
(g +Pr
)
_
= 0, (6.39)
with being the diagonal matrix with the steady-state probabilities
1
, . . . ,
n
along the diagonal [cf. Eq. (6.36)].
Thus the projected equation is written as
Cr
= d, (6.40)
where
C =
(I P), d =
g, (6.41)
and can be solved by matrix inversion:
r
= C
1
d,
just like the Bellman equation, which can also be solved by matrix inversion,
J = (I P)
1
g.
An important dierence is that the projected equation has smaller dimen-
sion (s rather than n). Still, however, computing C and d using Eq. (6.41),
requires computation of inner products of size n, so for problems where n
is very large, the explicit computation of C and d is impractical. We will
discuss shortly ecient methods to compute inner products of large size by
using simulation and low dimensional calculations. The idea is that an in-
ner product, appropriately normalized, can be viewed as an expected value
(the weighted sum of a large number of terms), which can be computed by
sampling its components with an appropriate probability distribution and
averaging the samples, as discussed in Section 6.1.5.
Here r
is the projection of g+Pr
, so r
(g+Pr
) is orthogonal
to the columns of . Alternatively, r
solves the problem

min
r
s
_
_
r (g +Pr
)
_
_
2
.
Setting to 0 the gradient with respect to r of the above quadratic expression, we
obtain Eq. (6.39).
Subspace S = {r | r
s
} Set
= 0
Value Iterate
Value Iterate T(r
k
) = g +Pr
k
Projection on
r
k
Projection on S
r
k
r
k+1
S r
k
Figure 6.3.2 Illustration of the projected value iteration (PVI) method
r
k+1
= T(r
k
).
At the typical iteration k, the current iterate r
k
is operated on with T, and the
generated vector T(r
k
) is projected onto S, to yield the new iterate r
k+1
.
6.3.2 Projected Value Iteration - Other Iterative Methods
We noted in Chapter 1 that for problems where n is very large, an iterative
method such as value iteration may be appropriate for solving the Bellman
equation J = TJ. Similarly, one may consider an iterative method for
solving the projected Bellman equation r = T(r) or its equivalent
version Cr = d [cf. Eqs. (6.40)-(6.41)].
Since T is a contraction (cf. Prop. 6.3.1), the rst iterative method
that comes to mind is the analog of value iteration: successively apply T,
starting with an arbitrary initial vector r
0
:
r
k+1
= T(r
k
), k = 0, 1, . . . . (6.42)
Thus at iteration k, the current iterate r
k
is operated on with T, and
the generated value iterate T(r
k
) (which does not necessarily lie in S)
is projected onto S, to yield the new iterate r
k+1
(see Fig. 6.3.2). We
refer to this as projected value iteration (PVI for short). Since T is a
contraction, it follows that the sequence r
k
generated by PVI converges
to the unique xed point r
of T.
It is possible to write PVI explicitly by noting that
r
k+1
= arg min
r
s
_
_
r (g +Pr
k
)
_
_
2
.
By setting to 0 the gradient with respect to r of the above quadratic ex-
pression, we obtain the orthogonality condition
_
r
k+1
(g +Pr
k
)
_
= 0,
[cf. Eq. (6.39)], which yields
r
k+1
= r
k
(
)
1
(Cr
k
d), (6.43)
where C and d are given by Eq. (6.41).
From the point of view of DP, the PVI method makes intuitive sense,
and connects well with established DP theory. However, the methodology
of iterative methods for solving linear equations suggests a much broader
set of algorithmic possibilities. In particular, in a generic lass of methods,
the current iterate r
k
is corrected by the residual Cr
k
d (which tends to
0), after scaling with some ss scaling matrix G, leading to the iteration
r
k+1
= r
k
G(Cr
k
d), (6.44)
where is a positive stepsize, and G is some s s scaling matrix. When
G = (
)
1
and = 1, we obtain the PVI method, but there are other
interesting possibilities. For example when G is the identity or a diagonal
approximation to (
)
1
, the iteration (6.44) is simpler than PVI in
that it does not require a matrix inversion (it does require, however, the
choice of a stepsize ).
The iteration (6.44) converges to the solution of the projected equa-
tion if and only if the matrix I GC has eigenvalues strictly within the
unit circle. The following proposition shows that this is true when G is
positive denite symmetric, as long as the stepsize is small enough to
compensate for large components in the matrix G. This hinges on an im-
portant property of the matrix C, which we now dene. Let us say that a
(possibly nonsymmetric) s s matrix M is positive denite if
r
Mr > 0, r ,= 0.
We say that M is positive semidenite if
r
Mr 0, r
s
.
The following proposition shows that C is positive denite, and if G is
positive denite and symmetric, the iteration (6.44) is convergent for suf-
ciently small stepsize .
Iterative methods that involve incremental changes along directions of the
form Gf(x) are very common for solving a system of equations f(x) = 0. They
arise prominently in cases where f(x) is the gradient of a cost function, or has
certain monotonicity properties. They also admit extensions to the case where
there are constraints on x (see [Ber09b], [Ber11a] for an analysis that is relevant
to the present DP context).
Proposition 6.3.3: The matrix C of Eq. (6.41) is positive denite.
Furthermore, if the s s matrix G is symmetric and positive denite,
there exists > 0 such that the eigenvalues of
I GC
lie strictly within the unit circle for all (0, ].
For the proof we need the following lemma, which is attributed to
Lyapunov (see Theorem 3.3.9, and Note 3.13.6 of Cottle, Pang, and Stone
[CPS92]).
Lemma 6.3.2: The eigenvalues of a positive denite matrix have
positive real parts.
Proof: Let M be a positive denite matrix. Then for suciently small
> 0 we have (/2)r
Mr < r
Mr for all r ,= 0, or equivalently

_
_
(I M)r
_
_
2
< |r|
2
, r ,= 0,
implying that IM is a contraction mapping with respect to the standard
Euclidean norm. Hence the eigenvalues of I M lie within the unit circle.
Since these eigenvalues are 1 , where are the eigenvalues of M, it
follows that if M is positive denite, the eigenvalues of M have positive
real parts. Q.E.D.
Proof of Prop. 6.3.3: For all r
s
, we have
|Pr|
|Pr|
|r|
, (6.45)
where the rst inequality follows from the Pythagorean Theorem,
|Pr|
2
= |Pr|
2
+|(I )Pr|
2
,
and the second inequality follows from Prop. 6.3.1. Also from properties of
projections, all vectors of the form r are orthogonal to all vectors of the
form x x, i.e.,
r
(I )x = 0, r
s
, x
n
, (6.46)
[cf. Eq. (6.36)]. Thus, we have for all r ,= 0,
r
Cr = r
(I P)r
= r
_
I P +( I)P
_
r
= r
(I P)r
= |r|
2
Pr
|r|
2
|r|
|Pr|
(1 )|r|
2
> 0,
where the third equality follows from Eq. (6.46), the rst inequality follows
from the Cauchy-Schwartz inequality applied with inner product < x, y >=
x
y, and the second inequality follows from Eq. (6.45). This proves the
positive deniteness of C.
If G is symmetric and positive denite, the matrix G
1/2
exists and is
symmetric and positive denite. Let M = G
1/2
CG
1/2
, and note that since
C is positive denite, M is also positive denite, so from Lemma 6.3.2
it follows that its eigenvalues have positive real parts. The eigenvalues
of M and GC are equal (with eigenvectors that are multiples of G
1/2
or
G
1/2
of each other), so the eigenvalues of GC have positive real parts. It
follows that the eigenvalues of I GC lie strictly within the unit circle for
suciently small > 0. This completes the proof of Prop. 6.3.3. Q.E.D.
Note that for the conclusion of Prop. 6.3.3 to hold, it is not necessary
that G is symmetric. It is sucient that GC has eigenvalues with positive
real parts. An example is G = C
1
, where is a positive denite sym-
metric matrix, in which case GC = C
1
C is a positive denite matrix.
Another example, which is important for our purposes as we will see later
(cf., Section 6.3.4), is
G = (C
1
C +I)
1
C
1
, (6.47)
where is a positive denite symmetric matrix, and is a positive scalar.
Then GC is given by
GC = (C
1
C +I)
1
C
1
C,
and can be shown to have real eigenvalues that lie in the interval (0, 1),
even if C is not positive denite. As a result I GC has real eigenvalues
in the interval (0, 1) for any (0, 2].
To see this let 1, . . . , s be the eigenvalues of C
1
C and let UU
be
its singular value decomposition, where = diag{1, . . . , s} and U is a unitary
matrix (UU
= I; see [Str09], [TrB97]). We also have C
1
C + I = U( +
I)U
, so
GC =
_
U( +I)U
_
1
UU
= U( +I)
1
U
.
It follows that the eigenvalues of GC are i/(i +), i = 1, . . . , s, and lie in the
Unfortunately, however, while PVI and its scaled version (6.44) are
conceptually important, they are not practical algorithms for problems
where n is very large. The reason is that the vector T(r
k
) is n-dimensional
and its calculation is prohibitive for the type of large problems that we aim
to address. Furthermore, even if T(r
k
) were calculated, its projection on
S requires knowledge of the steady-state probabilities
1
, . . . ,
n
, which are
generally unknown. Fortunately, both of these diculties can be dealt with
through the use of simulation, as we discuss next.
6.3.3 Simulation-Based Methods
We will now consider approximate versions of the methods for solving the
projected equation, which involve simulation and low-dimensional calcu-
lations. The idea is very simple: we collect simulation samples from the
Markov chain associated with the policy, and we average them to form a
matrix C
k
that approximates
C =
(I P),
and a vector d
k
that approximates
d =
g;
[cf. Eq. (6.41)]. We then approximate the solution C
1
d of the projected
equation with C
1
k
d
k
, or we approximate the term (Cr
k
d) in the PVI
iteration (6.43) [or its scaled version (6.44)] with (C
k
r
k
d
k
).
The simulation can be done as follows: we generate an innitely long
trajectory (i
0
, i
1
, . . .) of the Markov chain, starting from an arbitrary state
i
0
. After generating state i
t
, we compute the corresponding row (i
t
)
of ,
and after generating the transition (i
t
, i
t+1
), we compute the corresponding
cost component g(i
t
, i
t+1
). After collecting k +1 samples (k = 0, 1, . . .), we
form
C
k
=
1
k + 1
k
t=0
(i
t
)
_
(i
t
) (i
t+1
)
_
, (6.48)
interval (0, 1). Actually, the iteration
r
k+1
= r
k
G(Cr
k
d),
[cf. Eq. (6.44)], where G is given by Eq. (6.47), is the so-called proximal point
algorithm applied to the problem of minimizing (Crd)
1
(Crd) over r. From
known results about this algorithm (Martinet [Mar70] and Rockafellar [Roc76]) it
follows that the iteration will converge to a minimizing point of (Crd)
1
(Cr
d). Thus it will converge to some solution of the projected equation Cr = d, even
if there exist many solutions (as in the case where does not have rank s).
and
d
k
=
1
k + 1
k
t=0
(i
t
)g(i
t
, i
t+1
), (6.49)
where (i)
denotes the ith row of .

It can be proved using simple law of large numbers arguments that
C
k
C and d
k
d with probability 1. To show this, we use the expression
=
_
(1) (n)
to write C explicitly as
C =
(I P) =
n
i=1
i
(i)
_
_
(i)
n
j=1
p
ij
(j)
_
_
, (6.50)
and we rewrite C
k
in a form that matches the preceding expression, except
that the probabilities
i
and p
ij
are replaced by corresponding empirical
frequencies produced by the simulation. Indeed, by denoting () the in-
dicator function [(E) = 1 if the event E has occurred and (E) = 0
otherwise], we have
C
k
=
n
i=1
n
j=1
k
t=0
(i
t
= i, i
t+1
= j)
k + 1
_
(i)
_
(i) (j)
_
_
=
n
i=1
k
t=0
(i
t
= i)
k + 1
(i)
_
_
(i)
n
j=1
k
t=0
(i
t
= i, i
t+1
= j)
k
t=0
(i
t
= i)
(j)
_
_
and nally
C
k
=
n
i=1
i,k
(i)
_
_
(i)
n
j=1
p
ij,k
(j)
_
_
,
where
i,k
=
k
t=0
(i
t
= i)
k + 1
, p
ij,k
=
k
t=0
(i
t
= i, i
t+1
= j)
k
t=0
(i
t
= i)
. (6.51)
Here,

i,k
and p
ij,k
are the fractions of time that state i, or transition
(i, j) has occurred within (i
0
, . . . , i
k
), the initial (k + 1)-state portion of
the simulated trajectory. Since the empirical frequencies

i,k
and p
ij,k
asymptotically converge (with probability 1) to the probabilities
i
and
p
ij
, respectively, we have with probability 1,
C
k

n
i=1
i
(i)
_
_
(i)
n
j=1
p
ij
(j)
_
_
(I P) = C,
[cf. Eq. (6.50)]. Similarly, we can write
d
k
=
n
i=1
i,k
(i)
n
j=1
p
ij,k
g(i, j),
and we have
d
k

n
i=1
(i)
n
j=1
p
ij
g(i, j) =
g = d.
Note that from Eqs. (6.48)-(6.49), C
k
and d
k
can be updated in a
manner reminiscent of stochastic iterative methods, as new samples (i
k
)
and g(i
k
, i
k+1
) are generated. In particular, we have
C
k
= (1
k
)C
k1
+
k
(i
k
)
_
(i
k
) (i
k+1
)
_
,
d
k
= (1
k
)d
k1
+
k
(i
k
)g(i
k
, i
k+1
),
with the initial conditions C
1
= 0, d
1
= 0, and
k
=
1
k + 1
, k = 0, 1, . . . .
In these update formulas,
k
can be viewed as a stepsize, and indeed it can
be shown that C
k
and d
k
converge to C and d for other choices of
k
(see
[Yu10a,b]).
6.3.4 LSTD, LSPE, and TD(0) Methods
Given the simulation-based approximations C
k
and d
k
of Eqs. (6.48) and
(6.49), one possibility is to construct a simulation-based approximate solu-
tion
r
k
= C
1
k
d
k
. (6.52)
This is known as the LSTD (least squares temporal dierences) method.
Despite the dependence on the index k, this is not an iterative method,
since r
k1
is not needed to compute r
k
. Rather it may be viewed as an
approximate matrix inversion approach: we replace the projected equation
Cr = d with the approximation C
k
r = d
k
, using a batch of k+1 simulation
samples, and solve the approximate equation by matrix inversion. Note
that by using Eqs. (6.48) and (6.49), the equation C
k
r = d
k
can be written
as
C
k
r d
k
=
1
k + 1
k
t=0
(i
t
)q
k,t
= 0, (6.53)
where
q
k,t
= (i
t
)
r
k
(i
t+1
)
r
k
g(i
t
, i
t+1
). (6.54)
The scalar q
k,t
is the so-called temporal dierence, associated with r
k
and
transition (i
t
, i
t+1
). It may be viewed as a sample of a residual term arising
in the projected Bellmans equation. More specically, from Eqs. (6.40),
(6.41), we have
Cr
k
d =
(r
k
Pr
k
g). (6.55)
The three terms in the denition (6.54) of the temporal dierence q
k,t
can be viewed as samples [associated with the transition (i
t
, i
t+1
)] of the
corresponding three terms in the expression (r
k
Pr
k
g) in Eq.
(6.55).
Regression-Based LSTD
An important concern in LSTD is to ensure that the simulation-induced
error
e
k
= ( r
k
r
) = (C
1
k
d
k
C
1
d)
is not excessively large. Then the low-dimensional error C
1
k
d
k
C
1
d is
typically also large (the reverse is not true: r
k
r
may be large without e

k
being large). In the lookup table representation case ( = I) a large error
e
k
may be traced directly to the simulation error in evaluating C and d,
combined with near singularity of (IP). In the compact representation
case ( ,= I), the eect of near singularity of C on the high-dimensional
error e
k
is more complex, but is also primarily due to the same causes.
In what follows we will consider approaches to reduce the low-dimensional
error r
k
r
with the understanding that these approaches will also be

eective in reducing the high-dimensional error e
k
, when the latter is very
large.
Near-singularity of C, causing large low-dimensional errors C
1
k
d
k
C
1
d,
may be due either to the columns of being nearly linearly dependent or to the
matrix (I P) being nearly singular [cf. the formula C =
(I P) of Eq.
(6.41)]. However, near-linear dependence of the columns of will not aect the
high-dimensional error e
k
. The reason is that e
k
depends only on the subspace S
and not its representation in terms of the matrix . In particular, if we replace
with a matrix B where B is an s s invertible scaling matrix, the subspace
S will be unaected and the errors e
k
will also be unaected, as can be veried
using the formulas of Section 6.3.3. On the other hand, near singularity of the
matrix I P may aect signicantly e
k
. Note that I P is nearly singular
in the case where is very close to 1, or in the corresponding undiscounted case
where = 1 and P is substochastic with eigenvalues very close to 1 (see Section
6.6). Large variations in the size of the diagonal components of may also
aect signicantly e
k
, although this dependence is complicated by the fact that
appears not only in the formula C =
(I P) but also in the formula

d =
g.
Example 6.3.1
To get a rough sense of the potential eect of the simulation error in LSTD,
consider the approximate inversion of a small nonzero number c, which is
estimated with simulation error . The absolute and relative errors are
E =
1
c +

1
c
, Er =
E
1/c
.
By a rst order Taylor series expansion around = 0, we obtain for small
E
_
1/(c +)
_
=0
=

c
2
, Er
c
.
Thus for the estimate
1
c+
to be reliable, we must have || << |c|. If N
independent samples are used to estimate c, the variance of is proportional
to 1/N, so for a small relative error, N must be much larger than 1/c
2
. Thus
as c approaches 0, the amount of sampling required for reliable simulation-
based inversion increases very fast.
To reduce the size of the errors r
k
r
, an eective remedy is to
estimate r
by a form of regularized regression, which works even if C

k
is singular, at the expense of a systematic/deterministic error (a bias)
in the generated estimate. In this approach, instead of solving the system
C
k
r = d
k
, we use a least-squares t of a linear model that properly encodes
the eect of the simulation noise.
We write the projected form of Bellmans equation d = Cr as
d
k
= C
k
r +e
k
, (6.56)
where e
k
is the vector
e
k
= (C C
k
)r +d
k
d,
which we view as simulation noise. We then estimate the solution r
based on Eq. (6.56) by using regression. In particular, we choose r by

solving the least squares problem:
min
r
_
(d
k
C
k
r)
1
(d
k
C
k
r) +|r r|
2
_
, (6.57)
where r is an a priori estimate of r
, is some positive denite symmetric

matrix, and is a positive scalar. By setting to 0 the gradient of the least
squares objective in Eq. (6.57), we can nd the solution in closed form:
r
k
= (C
1
C
k
+ I)
1
(C
1
d
k
+ r). (6.58)
A suitable choice of r may be some heuristic guess based on intuition about
the problem, or it may be the parameter vector corresponding to the es-
timated cost vector r of a similar policy (for example a preceding policy
in an approximate policy iteration context). One may try to choose in
special ways to enhance the quality of the estimate of r
, but we will not

consider this issue here, and the subsequent analysis in this section does not
depend on the choice of , as long as it is positive denite and symmetric.
The quadratic |r r|
2
in Eq. (6.57) is known as a regularization
term, and has the eect of biasing the estimate r
k
towards the a priori
guess r. The proper size of is not clear (a large size reduces the eect of
near singularity of C
k
, and the eect of the simulation errors C
k
C and
d
k
d, but may also cause a large bias). However, this is typically not a
major diculty in practice, because trial-and-error experimentation with
dierent values of involves low-dimensional linear algebra calculations
once C
k
and d
k
become available.
We will now derive an estimate for the error r
k
r
, where r
= C
1
d
is the solution of the projected equation. Let us denote
b
k
=
1/2
(d
k
C
k
r
),
so from Eq. (6.58),
r
k
r
= (C
1
C
k
+I)
1
_
C
1/2
b
k
+( r r
)
_
. (6.59)
We have the following proposition, which involves the singular values of the
matrix
1/2
C
k
(these are the square roots of the eigenvalues of C
1
C
k
;
see e.g., [Str09], [TrB97]).
Proposition 6.3.4: We have
| r
k
r
| max
i=1,...,s
_

i
2
i
+
_
|b
k
|+ max
i=1,...,s
_

2
i
+
_
| rr
|, (6.60)
where
1
, . . . ,
s
are the singular values of
1/2
C
k
.
Proof: Let
1/2
C
k
= UV
be the singular value decomposition of
1/2
C
k
, where = diag
1
, . . . ,
s
, and U, V are unitary matrices
(UU
= V V
= I and |U| = |U
| = |V | = |V
| = 1; see [Str09],
[TrB97]). Then, Eq. (6.59) yields
r
k
r
= (V U
UV
+I)
1
(V U
b
k
+( r r
))
= (V
)
1
(
2
+I)
1
V
1
(V U
b
k
+ ( r r
))
= V (
2
+I)
1
U
b
k
+ V (
2
+I)
1
V
( r r
).
Condence Regions Approximation error
r
= r
r
k
=
1
C
k
+ I
1

C
1
d
k
+ r
1
C +I
1

C
1
d +r
r
k
= C
1
k
d
k
r
0
= r
= C
1
d
Figure 6.3.3 Illustration of Prop. 6.3.4. The gure shows the estimates
r
k
=
_
C
1
C
k
+ I
_
1
_
C
1
d
k
+ r
_
corresponding to a nite number of samples, and the exact values
r
=
_
C
1
C + I
_
1
_
C
1
d + r
_
corresponding to an innite number of samples. We may view r
k
r
as the sum
of a simulation error r
k
r
whose norm is bounded by the rst term in the

estimate (6.60) and can be made arbitrarily small by suciently long sampling,
and a regularization error r
whose norm is bounded by the second term

in the right-hand side of Eq. (6.60).
Therefore, using the triangle inequality, we have
| r
k
r
| |V | max
i=1,...,s
_

i
2
i
+
_
|U
| |b
k
|
+|V | max
i=1,...,s
_
1
2
i
+
_
|V
| | r r
|
= max
i=1,...,s
_

i
2
i
+
_
|b
k
| + max
i=1,...,s
_

2
i
+
_
| r r
|.
Q.E.D.
From Eq. (6.60), we see that the error | r
k
r
| is bounded by the
sum of two terms. The rst term can be made arbitrarily small by using a
suciently large number of samples, thereby making |b
k
| small. The sec-
ond term reects the bias introduced by the regularization and diminishes
with , but it cannot be made arbitrarily small by using more samples (see
Fig. 6.3.3).
Now consider the case where = 0, is the identity, and C
k
is
invertible. Then r
k
is the LSTD solution C
1
k
d
k
, and the proof of Prop.
6.3.4 can be replicated to show that
| r
k
r
| max
i=1,...,s
_
1
i
_
|b
k
|,
where
1
, . . . ,
s
are the (positive) singular values of C
k
. This suggests that
without regularization, the LSTD error can be adversely aected by near
singularity of the matrix C
k
(smallest
i
close to 0). Thus we expect that for
a nearly singular matrix C, a very large number of samples are necessary to
attain a small error ( r
k
r
), with serious diculties potentially resulting,

consistent with the scalar inversion example we gave earlier.
We also note an alternative and somewhat simpler regularization ap-
proach, whereby we approximate the equation C
k
r = d
k
by
(C
k
+I) r = d
k
+ r, (6.61)
where is a positive scalar and r is some guess of the solution r
= C
1
d.
We refer to [Ber11a] for more details on this method.
Generally, the regularization of LSTD alleviates the eects of near
singularity of C and simulation error, but it comes at a price: there is a
bias of the estimate r
k
towards the prior guess r (cf. Fig. 6.3.3). One
possibility to eliminate this bias is to adopt an iterative regularization
approach: start with some r, obtain r
k
, replace r by r
k
, and repeat for
any number of times. This turns LSTD to an iterative method, which will
be shown to be a special case of the class of iterative LSPE-type methods
to be discussed later.
LSPE Method
We will now develop a simulation-based implementation of the PVI itera-
tion
r
k+1
= T(r
k
).
By expressing the projection as a least squares minimization, we see that
r
k+1
is given by
r
k+1
= arg min
r
s
_
_
r T(r
k
)
_
_
2
,
or equivalently
r
k+1
= arg min
r
s
n
i=1
i
_
_
(i)
r
n
j=1
p
ij
_
g(i, j) +(j)
r
k
_
_
_
2
. (6.62)
We approximate this optimization by generating an innitely long trajec-
tory (i
0
, i
1
, . . .) and by updating r
k
after each transition (i
k
, i
k+1
) according
to
r
k+1
= arg min
r
s
k
t=0
_
(i
t
)
r g(i
t
, i
t+1
) (i
t+1
)
r
k
_
2
. (6.63)
We call this iteration least squares policy evaluation (LSPE for short).
The similarity of PVI [Eq. (6.62)] and LSPE [Eq. (6.63)] can be seen
by explicitly calculating the solutions of the associated least squares prob-
lems. For PVI, by setting the gradient of the cost function in Eq. (6.62) to
0 and using a straightforward calculation, we have
r
k+1
=
_
n
i=1
i
(i)(i)
_
1
_
_
n
i=1
i
(i)
n
j=1
p
ij
_
g(i, j) +(j)
r
k
_
_
_
.
(6.64)
For LSPE, we similarly have from Eq. (6.63)
r
k+1
=
_
k
t=0
(i
t
)(i
t
)
_
1
_
k
t=0
(i
t
)
_
g(i
t
, i
t+1
) +(i
t+1
)
r
k
_
_
.
(6.65)
This equation can equivalently be written as
r
k+1
=
_
n
i=1
i,k
(i)(i)
_
1
_
_
n
i=1
i,k
(i)
n
j=1
p
ij,k
_
g(i, j) +(j)
r
k
_
_
_
,
(6.66)
where

i,k
and p
ij,k
are empirical frequencies of state i and transition (i, j),
dened by
i,k
=
k
t=0
(i
t
= i)
k + 1
, p
ij,k
=
k
t=0
(i
t
= i, i
t+1
= j)
k
t=0
(i
t
= i)
. (6.67)
(We will discuss later the question of existence of the matrix inverses in the
preceding equations.) Here, () denotes the indicator function [(E) = 1
if the event E has occurred and (E) = 0 otherwise], so for example,

i,k
is
the fraction of time that state i has occurred within (i
0
, . . . , i
k
), the initial
(k+1)-state portion of the simulated trajectory. By comparing Eqs. (6.64)
and (6.66), we see that they asymptotically coincide, since the empirical
frequencies

i,k
and p
ij,k
asymptotically converge (with probability 1) to
the probabilities
i
and p
ij
, respectively.
Thus, LSPE may be viewed as PVI with simulation error added in the
right-hand side (see Fig. 6.3.3). Since the empirical frequencies

i,k
and p
ij,k
converge to the probabilities
i
and p
ij
, the error asymptotically diminishes
to 0 (assuming the iterates produced by LSPE are bounded). Because of
this diminishing nature of the error and the contraction property of T,
it is intuitively clear and can be rigorously shown that LSPE converges to
the same limit as PVI. The limit is the unique r
satisfying the equation

r
= T(r
)
[cf. Eq. (6.37)], and the error estimate of Prop. 6.3.2 applies. LSPE may
also be viewed as a special case of the class of simulation-based versions of
the deterministic iterative methods of Section 6.3.2, which we discuss next.
Other Iterative Simulation-Based Methods
An alternative to LSTD is to use a true iterative method to solve the
projected equation Cr = d using simulation-based approximations to C
and d. One possibility is to approximate the scaled PVI iteration [cf. Eq.
(6.44)]
r
k+1
= r
k
G(Cr
k
d) (6.68)
with
r
k+1
= r
k

G(

Cr
k

d), (6.69)
where

C and

d are simulation-based estimates of C and d, is a positive
stepsize, and

Gis an ss matrix, which may also be obtained by simulation.
Assuming that I

G
C is a contraction, this iteration will yield a solution

to the system

Cr =

d, which will serve as a simulation-based approximation
to a solution of the projected equation Cr = d.
Like LSTD, this may be viewed as a batch simulation approach: we
rst simulate to obtain

C,

d, and

G, and then solve the system

Cr =

d
by the iteration (6.69) rather than direct matrix inversion. An alternative
is to iteratively update r as simulation samples are collected and used to
form ever improving approximations to C and d. In particular, one or
more iterations of the form (6.69) may be performed after collecting a few
additional simulation samples that are used to improve the approximations
of the current

C and

d. In the most extreme type of such an algorithm,
the iteration (6.69) is used after a single new sample is collected. This
algorithm has the form
r
k+1
= r
k
G
k
(C
k
r
k
d
k
), (6.70)
where G
k
is an ss matrix, is a positive stepsize, and C
k
and d
k
are given
by Eqs. (6.48)-(6.49). For the purposes of further discussion, we will focus
on this algorithm, with the understanding that there are related versions
that use (partial) batch simulation and have similar properties. Note that
the iteration (6.70) may also be written in terms of temporal dierences as
r
k+1
= r
k

k + 1
G
k
k
t=0
(i
t
)q
k,t
(6.71)
[cf. Eqs. (6.48), (6.49), (6.54)]. The convergence behavior of this method is
satisfactory. Generally, we have r
k
r
, provided C
k
C, d
k
d, and
G
k
G, where G and are such that I GC is a contraction [this is
fairly evident in view of the convergence of the iteration (6.68), which was
shown in Section 6.3.2; see also the papers [Ber09b], [Ber11a]].
To ensure that I GC is a contraction for small , we may choose
G to be symmetric and positive denite, or to have a special form, such as
G = (C
1
C +I)
1
C
1
,
where is any positive denite symmetric matrix, and is a positive scalar
[cf. Eq. (6.47)].
Regarding the choices of and G
k
, one possibility is to choose = 1
and G
k
to be a simulation-based approximation to G = (
)
1
, which
is used in the PVI method (6.42)-(6.43):
G
k
=
_
1
k + 1
k
t=0
(i
t
)(i
t
)
_
1
, (6.72)
or
G
k
=
_

k + 1
I +
1
k + 1
k
t=0
(i
t
)(i
t
)
_
1
, (6.73)
where I is a positive multiple of the identity (to ensure that G
k
is positive
denite). Note that when = 1 and G
k
is given by Eq. (6.72), the iteration
(6.70) is identical to the LSPE iteration (6.65) [cf. the forms of C
k
and d
k
given by Eqs. (6.48) and (6.49)].
While G
k
, as dened by Eqs. (6.72) and (6.73), requires updating
and inversion at every iteration, a partial batch mode of updating G
k
is
also possible: one may introduce into iteration (6.70) a new estimate of
G = (
)
1
periodically, obtained from the previous estimate using
multiple simulation samples. This will save some computation and will not
aect the asymptotic convergence rate of the method, as we will discuss
shortly. Indeed, as noted earlier, the iteration (6.70) itself may be executed
in partial batch mode, after collecting multiple samples between iterations.
Note also that even if G
k
is updated at every k using Eqs. (6.72) and (6.73),
the updating can be done recursively; for example, from Eq. (6.72) we have
G
1
k
=
k
k + 1
G
1
k1
+
1
k + 1
(i
k
)(i
k
)
.
A simple possibility is to use a diagonal matrix G
k
, thereby simpli-
fying the matrix inversion in the iteration (6.70). One possible choice is a
diagonal approximation to
, obtained by discarding the o-diagonal

terms of the matrix (6.72) or (6.73). Then it is reasonable to expect that a
stepsize close to 1 will often lead to I GC being a contraction, thereby
facilitating the choice of . The simplest possibility is to just choose G
k
to
be the identity, although in this case, some experimentation is needed to
nd a proper value of such that I C is a contraction.
Another choice of G
k
is
G
k
= (C
1
k
C
k
+I)
1
C
1
k
, (6.74)
where
k
is some positive denite symmetric matrix, and is a positive
scalar. Then the iteration (6.70) takes the form
r
k+1
= r
k
(C
1
k
C
k
+I)
1
C
1
k
(C
k
r
k
d
k
),
and for = 1, it can be written as
r
k+1
= (C
1
k
C
k
+I)
1
(C
1
k
d
k
+r
k
). (6.75)
We recognize this as an iterative version of the regression-based LSTD
method (6.58), where the prior guess r is replaced by the previous iterate
r
k
. This iteration is convergent to r
provided that
1
k
is bounded
[ = 1 is within the range of stepsizes for which I GC is a contraction;
see the discussion following Eq. (6.47)].
A simpler regularization-based choice of G
k
is
G
k
= (C
k
+I)
1
[cf. Eq. (6.61)]. Then the iteration (6.70) takes the form
r
k+1
= r
k
(C
k
+I)
1
(C
k
r
k
d
k
). (6.76)
The convergence of this iteration can be proved, assuming that C is positive
denite, based on the fact C
k
C and standard convergence results for
the proximal point algorithm ([Mar70], [Roc76]); see also [Ber11a]. Note
that by contrast with Eq. (6.75), the positive deniteness of C is essential
for invertibility of C
k
+I and for convergence of Eq. (6.76).
Convergence Rate of Iterative Methods Comparison with
LSTD
Let us now discuss the choice of and G from the convergence rate point of
view. It can be easily veried with simple examples that the values of and
G aect signicantly the convergence rate of the deterministic scaled PVI
iteration (6.68). Surprisingly, however, the asymptotic convergence rate of
the simulation-based iteration (6.70) does not depend on the choices of
and G. Indeed it can be proved that the iteration (6.70) converges at the
same rate asymptotically, regardless of the choices of and G, as long as
I GC is a contraction (although the short-term convergence rate may
be signicantly aected by the choices of and G).
The reason is that the scaled PVI iteration (6.68) has a linear con-
vergence rate (since it involves a contraction), which is fast relative to the
slow convergence rate of the simulation-generated G
k
, C
k
, and d
k
. Thus
the simulation-based iteration (6.70) operates on two time scales (see, e.g.,
Borkar [Bor08], Ch. 6): the slow time scale at which G
k
, C
k
, and d
k
change,
and the fast time scale at which r
k
adapts to changes in G
k
, C
k
, and d
k
. As
a result, essentially, there is convergence in the fast time scale before there
is appreciable change in the slow time scale. Roughly speaking, r
k
sees
G
k
, C
k
, and d
k
as eectively constant, so that for large k, r
k
is essentially
equal to the corresponding limit of iteration (6.70) with G
k
, C
k
, and d
k
held xed. This limit is C
1
k
d
k
. It follows that the sequence r
k
generated
by the scaled LSPE iteration (6.70) tracks the sequence C
1
k
d
k
generated
by the LSTD iteration in the sense that
|r
k
C
1
k
d
k
| << |r
k
r
|, for large k,
independent of the choice of and the scaling matrix G that is approxi-
mated by G
k
(see also [YuB06b], [Ber09b], [Ber11a] for analysis and further
discussion).
TD(0) Method
This is an iterative method for solving the projected equation Cr = d. Like
LSTD and LSPE, it generates an innitely long trajectory i
0
, i
1
, . . . of
the Markov chain, but at each iteration, it uses only one sample, the last
one. It has the form
r
k+1
= r
k

k
(i
k
)q
k,k
, (6.77)
where
k
is a stepsize sequence that diminishes to 0. It may be viewed as
an instance of a classical stochastic approximation scheme for solving the
projected equation Cr = d. This equation can be written as
(r
Ar b) = 0, and by using Eqs. (6.54) and (6.77), it can be seen that
the direction of change (i
k
)q
k,k
in TD(0) is a sample of the left-hand side
(r Ar b) of the equation.
Let us note a similarity between TD(0) and the scaled LSPE method
(6.71) with G
k
= I, given by:
r
k+1
= r
k
(C
k
r
k
d
k
) = r
k

k + 1
k
t=0
(i
t
)q
k,t
. (6.78)
While LSPE uses as direction of change a time-average approximation of
Cr
k
d based on all the available samples, TD(0) uses a single sample
approximation. It is thus not surprising that TD(0) is a much slower algo-
rithm than LSPE, and moreover requires that the stepsize
k
diminishes
to 0 in order to deal with the nondiminishing noise that is inherent in the
term (i
k
)q
k,k
of Eq. (6.77). On the other hand, TD(0) requires much less
overhead per iteration: calculating the single temporal dierence q
k,k
and
multiplying it with (i
k
), rather than updating the s s matrix C
k
and
multiplying it with r
k
. Thus when s, the number of features, is very large,
TD(0) may oer a signicant overhead advantage over LSTD and LSPE.
We nally note a scaled version of TD(0) given by
r
k+1
= r
k

k
G
k
(i
k
)q
k,k
, (6.79)
where G
k
is a positive denite symmetric scaling matrix, selected to speed
up convergence. It is a scaled (by the matrix G
k
) version of TD(0), so it
may be viewed as a type of scaled stochastic approximation method.
6.3.5 Optimistic Versions
In the LSTD and LSPE methods discussed so far, the underlying assump-
tion is that each policy is evaluated with a very large number of samples,
so that an accurate approximation of C and d are obtained. There are also
optimistic versions (cf. Section 6.1.2), where the policy is replaced by
an improved policy after only a certain number of simulation samples
have been processed.
A natural form of optimistic LSTD is r
k+1
= C
1
k
d
k
, where C
k
and
d
k
are obtained by averaging samples collected using the controls corre-
sponding to the (approximately) improved policy. By this we mean that
C
k
and d
k
are time averages of the matrices and vectors
(i
t
)
_
(i
t
) (i
t+1
)
_
, (i
t
)g(i
t
, i
t+1
),
corresponding to simulated transitions (i
t
, i
t+1
) that are generated using
the policy
k+1
whose controls are given by
k+1
(i) arg min
uU(i)
n
j=1
p
ij
(u)
_
g(i, u, j) +(j)
r
k
_
[cf. Eq. (6.52)]. Unfortunately, this method requires the collection of many
samples between policy updates, as it is susceptible to simulation noise in
C
k
and d
k
.
The optimistic version of (scaled) LSPE is based on similar ideas.
Following the state transition (i
k
, i
k+1
), we update r
k
using the iteration
r
k+1
= r
k
G
k
(C
k
r
k
d
k
), (6.80)
where C
k
and d
k
are given by Eqs. (6.48), (6.49) [cf. Eq. (6.70)], and G
k
is a
scaling matrix that converges to some G for which I GC is a contraction.
For example G
k
could be a positive denite symmetric matrix [such as for
example the one given by Eq. (6.72)] or the matrix
G
k
= (C
1
k
C
k
+I)
1
C
1
k
(6.81)
[cf. Eq. (6.74)]. In the latter case, for = 1 the method takes the form
r
k+1
= (C
1
k
C
k
+I)
1
(C
1
k
d
k
+ r
k
), (6.82)
[cf. Eq. (6.75)]. The simulated transitions are generated using a policy
that is updated every few samples. In the extreme case of a single sample
between policies, we generate the next transition (i
k+1
, i
k+2
) using the
control
u
k+1
= arg min
uU(i
k+1
)
n
j=1
p
i
k+1
j
(u)
_
g(i
k+1
, u, j) +(j)
r
k+1
_
.
Because the theoretical convergence guarantees of LSPE apply only to the
nonoptimistic version, it may be essential to experiment with various values
of the stepsize [this is true even if G
k
is chosen according to Eq. (6.72), for
which = 1 guarantees convergence in the nonoptimistic version]. There
is also a similar optimistic version of TD(0).
To improve the reliability of the optimistic LSTD method it seems
necessary to turn it into an iterative method, which then brings it closer
to LSPE. In particular, an iterative version of the regression-based LSTD
method (6.58) is given by Eq. (6.82), and is the special case of LSPE,
corresponding to the special choice of the scaling matrix G
k
of Eq. (6.81).
Generally, in optimistic LSTD and LSPE, a substantial number of
samples may need to be collected with the same policy before switching
policies, in order to reduce the variance of C
k
and d
k
. As an alternative,
one may consider building up C
k
and d
k
as weighted averages, using sam-
ples from several past policies, while giving larger weight to the samples
of the current policy. One may argue that mixing samples from several
past policies may have a benecial exploration eect. Still, however, sim-
ilar to other versions of policy iteration, to enhance exploration, one may
occasionally introduce randomly transitions other than the ones dictated
by the current policy (cf. the discussion of Section 6.1.2). The complexities
introduced by these variations are not fully understood at present. For ex-
perimental investigations of optimistic policy iteration, see Bertsekas and
Ioe [BeI96], Jung and Polani [JuP07], Busoniu et al. [BED09], and Thiery
and Scherrer [ThS10a].
6.3.6 Multistep Simulation-Based Methods
A useful approach in approximate DP is to replace Bellmans equation with
an equivalent equation that reects control over multiple successive stages.
This amounts to replacing T with a multistep version that has the same
xed points; for example, T
with > 1, or T
()
given by
T
()
= (1 )
=0
T
+1
,
where (0, 1). We will focus on the -weighted multistep Bellman
equation
J = T
()
J.
By noting that
T
2
J = g +P(TJ) = g +P(g +PJ) = (I +P)g +
2
P
2
J,
T
3
J = g +P(T
2
J) = g +P
_
(I +P)g +
2
P
2
J
_
= (I +P +
2
P
2
)g +
3
P
3
J,
etc, this equation can be written as
J = T
()
J = g
()
+P
()
J, (6.83)
with
P
()
= (1 )
=0
P
+1
, g
()
=
=0
g = (I P)
1
g.
(6.84)
We may then apply variants of the preceding simulation algorithms
to nd a xed point of T
()
in place of T. The corresponding projected
equation takes the form
C
()
r = d
()
,
where
C
()
=
_
I P
()
_
, d
()
=
g
()
, (6.85)
[cf. Eq. (6.41)]. The motivation for replacing T with T
()
is that the mod-
ulus of contraction of T
()
is smaller, resulting in a tighter error bound.
This is shown in the following proposition.
Proposition 6.3.5: The mappings T
()
and T
()
are contractions
of modulus
=
(1 )
1
with respect to the weighted Euclidean norm | |
, where is the
steady-state probability vector of the Markov chain. Furthermore
|J

1
_
1
2
|J
, (6.86)
where r
is the xed point of T

()
.
Proof: Using Lemma 6.3.1, we have
|P
()
z|
(1 )
=0
|P
+1
z|
(1 )
=0
|z|
=
(1 )
1
|z|
.
Since T
()
is linear with associated matrix P
()
[cf. Eq. (6.83)], it follows
that T
()
is a contraction with modulus (1 )/(1 ). The estimate
(6.86) follows similar to the proof of Prop. 6.3.2. Q.E.D.
Note that
decreases as increases, and
0 as 1. Fur-
thermore, the error bound (6.86) becomes better as increases. Indeed
from Eq. (6.86), it follows that as 1, the projected equation solution
r
converges to the best approximation J
of J
on S. This suggests
that large values of should be used. On the other hand, we will later
argue that when simulation-based approximations are used, the eects of
simulation noise become more pronounced as increases. Furthermore, we
should note that in the context of approximate policy iteration, the objec-
tive is not just to approximate well the cost of the current policy, but rather
to use the approximate cost to obtain the next improved policy. We are
ultimately interested in a good next policy, and there is no consistent
experimental or theoretical evidence that this is achieved solely by good
cost approximation of the current policy. Thus, in practice, some trial and
error with the value of may be useful.
Another interesting fact, which follows from lim
1
= 0, is that
given any norm, the mapping T
()
is a contraction (with arbitrarily small
modulus) with respect to that norm for suciently close to 1. This
is a consequence of the norm equivalence property in
n
(any norm is
bounded by a constant multiple of any other norm). As a result, for any
weighted Euclidean norm of projection, T
()
is a contraction provided
is suciently close to 1.
LSTD(), LSPE(), and TD()
The simulation-based methods of the preceding subsections correspond to
= 0, but can be extended to > 0. In particular, in a matrix inversion
approach, the unique solution of the projected equation may be approxi-
mated by
_
C
()
k
_
1
d
()
k
, (6.87)
where C
()
k
and d
()
k
are simulation-based approximations of C
()
and d
()
,
given by Eq. (6.85). This is the LSTD() method. There is also a regres-
sion/regularization variant of this method along the lines described earlier
[cf. Eq. (6.58)].
Similarly, we may consider the (scaled) LSPE() iteration
r
k+1
= r
k
G
k
_
C
()
k
r
k
d
()
k
_
, (6.88)
where is a stepsize and G
k
is a scaling matrix that converges to some G
such that I GC
()
is a contraction. One possibility is to choose = 1
and
G
k
=
_
1
k + 1
k
t=0
(i
t
)(i
t
)
_
1
,
[cf. Eq. (6.72)]. Diagonal approximations to this matrix may also be used to
avoid the computational overhead of matrix inversion. Another possibility
is
G
k
=
_
C
()
k

1
k
C
()
k
+I
_
1
C
()
k

1
k
, (6.89)
where
k
is some positive denite symmetric matrix, and is a positive
scalar [cf. Eq. (6.74)]. For = 1, we obtain the iteration
r
k+1
=
_
C
()
k

1
k
C
()
k
+I
_
1
_
C
()
k

1
k
d
()
k
+r
k
_
. (6.90)
This as an iterative version of the regression-based LSTD method [cf. Eq.
(6.75)], for which convergence is assured provided C
()
k
C
()
, d
()
k

d
()
, and
1
k
is bounded.
Regarding the calculation of appropriate simulation-based approxi-
mations C
()
k
and d
()
k
, one possibility is the following extension of Eqs.
(6.48)-(6.49):
C
()
k
=
1
k + 1
k
t=0
(i
t
)
k
m=t
mt
mt
_
(i
m
) (i
m+1
)
_
, (6.91)
d
()
k
=
1
k + 1
k
t=0
(i
t
)
k
m=t
mt
mt
g
im
. (6.92)
It can be shown that indeed these are correct simulation-based approxima-
tions to C
()
and d
()
of Eq. (6.85). The verication is similar to the case
= 0, by considering the approximation of the steady-state probabilities
i
and transition probabilities p
ij
with the empirical frequencies

i,k
and
p
ij,k
dened by Eq. (6.67).
For a sketch of the argument, we rst verify that the rightmost ex-
pression in the denition (6.91) of C
()
k
can be written as
k
m=t
mt
mt
_
(i
m
) (i
m+1
)
_
= (i
t
) (1 )
k1
m=t
mt
mt
(i
m+1
)
kt+1
kt
(i
k+1
),
which by discarding the last term (it is negligible for k >> t), yields
k
m=t
mt
mt
_
(i
m
) (i
m+1
)
_
= (i
t
) (1 )
k1
m=t
mt
mt
(i
m+1
).
Using this relation in the expression (6.91) for C
()
k
, we obtain
C
()
k
=
1
k + 1
k
t=0
(i
t
)
_
(i
t
) (1 )
k1
m=t
mt
mt
(i
m+1
)
_
.
We now compare this expression with C
()
, which similar to Eq. (6.50),
can be written as
C
()
=
(I P
()
) =
n
i=1
i
(i)
_
_
(i)
n
j=1
p
()
ij
(j)
_
_
,
where p
()
ij
are the components of the matrix P
()
. It can be seen (cf. the
derivations of Section 6.3.3) that
1
k + 1
k
t=0
(i
t
)(i
t
)
i=1
i
(i)(i)
,
while by using the formula
p
()
ij
= (1 )
=0
p
(+1)
ij
with p
(+1)
ij
being the (i, j)th component of P
(+1)
[cf. Eq. (6.84)], it can
be veried that
1
k + 1
k
t=0
(i
t
)
_
(1 )
k1
m=t
mt
mt
(i
m+1
)
i=1
i
(i)
n
j=1
p
()
ij
(j)
.
Thus, by comparing the preceding expressions, we see that C
()
k
C
()
with probability 1. A full convergence analysis can be found in [NeB03] and
also in [BeY09], [Yu10a,b], in a more general exploration-related context,
to be discussed in Section 6.3.7 and also in Section 6.8.
We may also streamline the calculation of C
()
k
and d
()
k
by introduc-
ing the vector
z
t
=
t
m=0
()
tm
(i
m
), (6.93)
which is often called the eligibility vector; it is a weighted sum of the present
and past feature vectors (i
m
) obtained from the simulations [discounted
by ()
tm
]. Then, by straightforward calculation, we may verify that
C
()
k
=
1
k + 1
k
t=0
z
t
_
(i
t
) (i
t+1
)
_
, (6.94)
d
()
k
=
1
k + 1
k
t=0
z
t
g(i
t
, i
t+1
). (6.95)
Note that z
k
, C
()
k
, d
()
k
, can be conveniently updated by means of recursive
formulas, as in the case = 0. In particular, we have
z
k
= z
k1
+(i
k
),
C
()
k
= (1
k
)C
()
k1
+
k
z
k
_
(i
k
) (i
k+1
)
_
,
d
()
k
= (1
k
)d
()
k1
+
k
z
k
g(i
k
, i
k+1
),
with the initial conditions z
1
= 0, C
1
= 0, d
1
= 0, and
k
=
1
k + 1
, k = 0, 1, . . . .
Let us also note that by using the above formulas for C
()
k
and d
()
k
,
the scaled LSPE() iteration (6.88) can also be written as
r
k+1
= r
k

k + 1
G
k
k
t=0
z
t
q
k,t
, (6.96)
where q
k,t
is the temporal dierence
q
k,t
= (i
t
)
r
k
(i
t+1
)
r
k
g(i
t
, i
t+1
) (6.97)
[cf. Eqs. (6.54) and (6.70)].
The TD() algorithm is essentially TD(0) applied to the multistep
projected equation C
()
r = d
()
. It takes the form
r
k+1
= r
k

k
z
k
q
k,k
, (6.98)
where
k
is a stepsize parameter. When compared to the scaled LSPE()
method (6.96), we see that TD() uses G
k
= I and only the latest temporal
dierence q
k,k
. This amounts to approximating C
()
and d
()
by a single
sample, instead of k + 1 samples. Note that as 1, z
k
approaches
k
t=0
kt
(i
t
) [cf. Eq. (6.93)], and TD() approaches the TD(1) method
given earlier in Section 6.2 [cf. Eq. (6.32)].
Least Squares Implementation of LSPE()
Let us now discuss an alternative development of the (unscaled) LSPE()
method, which is based on the PVI() method and parallels the imple-
mentation (6.63) for LSPE(0). We rst obtain an alternative formula for
T
()
, and to this end we view T
t+1
J as the vector of costs over a horizon
of (t + 1) stages with the terminal cost function being J, and write
T
t+1
J =
t+1
P
t+1
J +
t
k=0
k
P
k
g. (6.99)
As a result the mapping T
()
= (1 )
t=0
t
T
t+1
can be expressed as
(T
()
J)(i) =
t=0
(1 )
t
E
_
t+1
J(i
t+1
) +
t
k=0
k
g(i
k
, i
k+1
)
i
0
= i
_
,
(6.100)
which can be written as
(T
()
J)(i) = J(i) + (1 )
t=0
t
k=0
k
E
_
g(i
k
, i
k+1
) +J
t
(i
k+1
) J
t
(i
k
) [ i
0
= i
_
= J(i) + (1 )
k=0
_

t=k
t
_
k
E
_
g(i
k
, i
k+1
) +J(i
k+1
) J(i
k
) [ i
0
= i
_
and nally,
(T
()
J)(i) = J(i) +
t=0
()
k
E
_
g(i
t
, i
t+1
) +J(i
t+1
) J(i
t
) [ i
0
= i
_
.
Using this equation, we can write the PVI() iteration
r
k+1
= T
()
(r
k
)
as
r
k+1
= arg min
r
s
n
i=1
i
_
(i)
r (i)
r
k
t=0
()
t
E
_
g(i
t
, i
t+1
) +(i
t+1
)
r
k
(i
t
)
r
k
[ i
0
= i
_
_
2
and by introducing the temporal dierences
d
k
(i
t
, i
t+1
) = g(i
t
, i
t+1
) +(i
t+1
)
r
k
(i
t
)
r
k
,
we nally obtain PVI() in the form
r
k+1
= arg min
r
s
n
i=1
i
_
(i)
r (i)
r
k
t=0
()
t
E
_
d
k
(i
t
, i
t+1
) [ i
0
= i
_
_
2
.
(6.101)
The LSPE() method is a simulation-based approximation to the
above PVI() iteration. It has the form
r
k+1
= arg min
r
s
k
t=0
_
(i
t
)
r (i
t
)
r
k

k
m=t
()
mt
d
k
(i
m
, i
m+1
)
_
2
,
(6.102)
where (i
0
, i
1
, . . .) is an innitely long trajectory generated by simulation.
The justication is that the solution of the least squares problem in the
PVI() iteration (6.101) is approximately equal to the solution of the least
squares problem in the LSPE() iteration (6.102). Similar to the case = 0
[cf. Eqs. (6.62) and (6.63)], the approximation is due to:
(a) The substitution of the steady-state probabilities
i
and transition
probabilities p
ij
with the empirical frequencies

i,k
and p
ij,k
dened
by Eq. (6.67).
(b) The approximation of the innite discounted sum of temporal dier-
ences in Eq. (6.101) with the nite discounted sum in Eq. (6.102),
which also uses an approximation of the conditional probabilities of
the transitions (i
t
, i
t+1
) with corresponding empirical frequencies.
Since as k , the empirical frequencies converge to the true probabilities
and the nite discounted sums converge to the innite discounted sums, it
follows that PVI() and LSPE() asymptotically coincide.
Exploration-Enhanced LSPE(), LSTD(), and TD()
We next develop an alternative least squares implementation of LSPE().
It uses multiple simulation trajectories and the initial state of each tra-
jectory may be chosen essentially as desired, thereby allowing exibility
to generate a richer mixture of state visits. In particular, we generate t
simulated trajectories. The states of a trajectory are generated according
to the transition probabilities p
ij
of the policy under evaluation, the tran-
sition cost is discounted by an additional factor with each transition,
and following each transition to a state j, the trajectory is terminated with
probability 1 and with an extra cost (i)
r
k
, where r
k
is the cur-
rent estimate of the cost vector of the policy under evaluation. Once a
trajectory is terminated, an initial state for the next trajectory is chosen
according to a xed probability distribution
0
=
_
0
(1), . . . ,
0
(n)
_
, where
0
(i) = P(i
0
= i) > 0, i = 1, . . . , n,
and the process is repeated. The details are as follows.
Let the mth trajectory have the form (i
0,m
, i
1,m
, . . . , i
Nm,m
), where
i
0,m
is the initial state, and i
Nm,m
is the state at which the trajectory
is completed (the last state prior to termination). For each state i
,m
,
= 0, . . . , N
m
1, of the mth trajectory, the simulated cost is
c
,m
(r
k
) =
Nm
(i
Nm,m
)
r
k
+
Nm1
q=
q
g(i
q,m
, i
q+1,m
). (6.103)
Once the costs c
,m
(r
k
) are computed for all states i
,m
of the mth trajec-
tory and all trajectories m = 1, . . . , t, the vector r
k+1
is obtained by a least
squares t of these costs:
r
k+1
= arg min
r
s
t
m=1
Nm1
=0
_
(i
,m
)
r c
,m
(r
k
)
_
2
, (6.104)
similar to Eq. (6.102).
We will now show that in the limit, as t , the vector r
k+1
of Eq.
(6.104) satises
r
k+1
=

T
()
(r
k
), (6.105)
where

denotes projection with respect to the weighted sup-norm with
weight vector =
_
(1), . . . , (n)
_
, where
(i) =
(i)
n
j=1
(j)
, i = 1, . . . , n,
and

(i) =
=0
(i), with
(i) being the probability of the state being

i after steps of a randomly chosen simulation trajectory. Note that (i)
is the long-term occupancy probability of state i during the simulation
process.
Indeed, let us view T
+1
J as the vector of total discounted costs over
a horizon of ( + 1) stages with the terminal cost function being J, and
write
T
+1
J =
+1
P
+1
k+1
J +
q=0
q
P
q
g
k+1
.
where P and g are the transition probability matrix and cost vector, re-
spectively, under the current policy. As a result the vector T
()
J =
(1 )
=0
T
+1
J can be expressed as
_
T
()
J
_
(i) =
=0
(1 )
E
_
+1
J(i
+1
) +
q=0
q
g(i
q
, i
q+1
)
i
0
= i
_
.
(6.106)
Thus
_
T
()
J
_
(i) may be viewed as the expected value of the ( +1)-stages
cost of the policy under evaluation starting at state i, with the number
of stages being random and geometrically distributed with parameter
[probability of + 1 transitions is (1 )
, = 0, 1, . . .]. It follows that

the cost samples c
,m
(r
k
) of Eq. (6.103), produced by the simulation process
described earlier, can be used to estimate
_
T
()
(r
k
)
_
(i) for all i by Monte
Carlo averaging. The estimation formula is
C
t
(i) =
1
t
m=1
Nm1
=0
(i
,m
= i)
m=1
Nm1
=0
(i
,m
= i)c
,m
(r
k
), (6.107)
where for any event E, we denote by (E) the indicator function of E, and
we have
_
T
()
(r
k
)
_
(i) = lim
t
C
t
(i), i = 1, . . . , n,
(see also the discussion on the consistency of Monte Carlo simulation for
policy evaluation in Section 6.2, Exercise 6.2, and [BeT96], Section 5.2).
Let us now compare iteration (6.105) with the simulation-based im-
plementation (6.104). Using the denition of projection, Eq. (6.105) can
be written as
r
k+1
= arg min
r
s
n
i=1
(i)
_
(i)
r
_
T
()
(r
k
)
_
(i)
_
2
,
or equivalently
r
k+1
=
_
n
i=1
(i)(i)(i)
_
1
n
i=1
(i)(i)
_
T
()
(r
k
)
_
(i). (6.108)
Let

(i) be the empirical relative frequency of state i during the simulation,
given by
(i) =
1
N
1
+ +N
t
t
m=1
Nm1
=0
(i
,m
= i). (6.109)
Then the simulation-based estimate (6.104) can be written as
r
k+1
=
_
t
m=1
Nm1
=0
(i
,m
)(i
,m
)
_
1
t
m=1
Nm1
=0
(i
,m
)c
,m
(r
k
)
=
_
n
i=1
t
m=1
Nm1
=0
(i
,m
= i)(i)(i)
_
1
i=1
t
m=1
Nm1
=0
(i
,m
= i)(i)c
,m
(r
k
)
=
_
n
i=1
(i)(i)(i)
_
1
i=1
1
N
1
+ +N
t
(i)
t
m=1
Nm1
=0
(i
,m
= i)c
,m
(r
k
)
=
_
n
i=1
(i)(i)(i)
_
1
n
i=1
t
m=1
Nm1
=0
(i
,m
= i)
N
1
+ +N
t
(i)
t
m=1
Nm1
=0
(i
,m
= i)
m=1
Nm1
=0
(i
,m
= i)c
,m
(r
k
)
and nally, using Eqs. (6.107) and (6.109),
r
k+1
=
_
n
i=1
(i)(i)(i)
_
1
n
i=1
(i)(i)C
t
(i). (6.110)
Since
_
T
()
(r
k
)
_
(i) = lim
t
C
t
(i) and (i) = lim
t

(i), we see that
the iteration (6.108) and the simulation-based implementation (6.110) asymp-
totically coincide.
An important fact is that the implementation just described deals
eectively with the issue of exploration. Since each simulation trajectory
is completed at each transition with the potentially large probability 1,
a restart with a new initial state i
0
is frequent and the length of each of
the simulated trajectories is relatively small. Thus the restart mechanism
can be used as a natural form of exploration, by choosing appropriately
the restart distribution
0
, so that
0
(i) reects a substantial weight for
all states i.
An interesting special case is when = 0, in which case the simulated
trajectories consist of a single transition. Thus there is a restart at every
transition, which means that the simulation samples are from states that
are generated independently according to the restart distribution
0
.
We can also develop similarly, a least squares exploration-enhanced
implementation of LSTD(). We use the same simulation procedure, and
in analogy to Eq. (6.103) we dene
c
,m
(r) =
Nm
(i
Nm,m
)
r +
Nm1
q=
q
g(i
q,m
, i
q+1,m
).
The LSTD() approximation r to the projected equation
r =

T
()
(r),
[cf. Eq. (6.105)] is determined from the xed point equation
r = arg min
r
s
t
m=1
Nm1
=0
_
(i
,m
)
r c
,m
( r)
_
2
. (6.111)
By writing the optimality condition
t
m=1
Nm1
=0
(i
,m
)
_
(i
,m
)
r c
,m
( r)
_
= 0
for the above least squares minimization and solving for r, we obtain
r =

C
1
d, (6.112)
where
C =
t
m=1
Nm1
=0
(i
,m
)
_
(i
,m
)
Nm
(i
Nm,m
)
_
, (6.113)
and
d =
t
m=1
Nm1
=0
Nm1
q=
q
g(i
q,m
, i
q+1,m
). (6.114)
For a large number of trajectories t, the methods (6.104) and (6.111) [or
equivalently (6.112)-(6.114)] yield similar results, particularly when 1.
However, the method (6.104) has an iterative character (r
k+1
depends on
r
k
), so it is reasonable to expect that it is less susceptible to simulation
noise in an optimistic PI setting where the number of samples per policy
is low.
Similarly, to obtain an exploration-enhanced TD(), we simply solve
approximately the least squares problem in Eq. (6.104) by iterating, per-
haps multiple times, with an incremental gradient method. The details of
this type of algorithm are straightforward (see Section 6.2). The method
does not involve matrix inversion like the exploration-enhanced implemen-
tations (6.104) and (6.112)-(6.114) of LSPE() and LSTD(), respectively,
but is much slower and less reliable.
Feature Scaling and its Eect on LSTD(), LSPE(), and TD()
Let us now discuss how the representation of the approximation subspace
S aects the results produced by LSTD(), LSPE(), and TD(). In
particular, suppose that instead of S being represented as
S = r [ r
r
,
it is equivalently represented as
S = v [ v
r
,
where
= B,
with B being an invertible r r matrix. Thus S is represented as the span
of a dierent set of basis functions, and any vector r S can be written
as v, where the weight vector v is equal to Br. Moreover, each row (i)
,
the feature vector of state i in the representation based on , is equal to
(i)
B, the linearly transformed feature vector of i in the representation

based on .
Suppose that we generate a trajectory (i
0
, i
1
, . . .) according to the sim-
ulation process of Section 6.3.3, and we calculate the iterates of LSTD(),
LSPE(), and TD() using the two dierent representations of S, based on
and . Let C
()
k,
and C
()
k,
be the corresponding matrices generated by
Eq. (6.94), and let d
()
k,
and d
()
k,
be the corresponding vectors generated
by Eq. (6.95). Let also z
t,
and z
t,
be the corresponding eligibility vectors
generated by Eq. (6.93). Then, since (i
m
) = B
(i
m
), we have
z
t,
= B
z
t,
,
and from Eqs. (6.94) and (6.95),
C
()
k,
= B
C
()
k,
B, d
()
k,
= B
d
()
k,
.
We now wish to compare the high dimensional iterates r
k
and v
k
produced by dierent methods. Based on the preceding equation, we claim
that LSTD() is scale-free in the sense that r
k
= v
k
for all k. Indeed,
in the case of LSTD() we have [cf. Eq. (6.87)]
r
k
=
_
C
()
k,
_
1
d
()
k,
= B
_
B
C
()
k,
B
_
1
B
d
()
k,
=
_
C
()
k,
_
1
d
()
k,
= v
k
.
We also claim that LSPE() with
G
k
=
_
1
k + 1
k
t=0
(i
t
)(i
t
)
_
1
,
[cf. Eq. (6.72)], is scale-free in the sense that r
k
= v
k
for all k. This
follows from Eq. (6.96) using a calculation similar to the one for LSTD(),
but it also follows intuitively from the fact that LSPE(), with G
k
as
given above, is a simulation-based implementation of the PVI() iteration
J
k+1
= TJ
k
, which involves the projection operator that is scale-free
(does not depend on the representation of S).
We nally note that the TD() iteration (6.98) is not scale-free unless
B is an orthogonal matrix (BB
= I). This can be veried with a direct

calculation using the iteration (6.98) for the case of the two representations
of S based on and . In particular, let r
k
be generated by TD() based
on ,
r
k+1
= r
k

k
z
k,
_
(i
k
)
r
k
(i
k+1
)
r
k
g(i
k
, i
k+1
)
_
,
and let v
k
be generated by TD() based on ,
v
k+1
= v
k

k
z
k,
_
(i
k
)
v
k
(i
k+1
)
v
k
g(i
k
, i
k+1
)
_
,
[cf. Eqs. (6.97), (6.98)]. Then, we generally have r
k
,= v
k
, since r
k
=
Br
k
and Br
k
,= v
k
. In particular, the vector v
k
= Br
k
is generated by
the iteration
v
k+1
= v
k

k
z
k,
(BB
)
_
(i
k
)
v
k
(i
k+1
)
v
k
g(i
k
, i
k+1
)
_
,
which is dierent from the iteration that generates v
k
, unless BB
= I.
This analysis also indicates that the appropriate value of the stepsize
k
in TD() strongly depends on the choice of basis functions to represent S,
and points to a generic weakness of the method.
6.3.7 Policy Iteration Issues Exploration
We have discussed so far policy evaluation methods based on the projected
equation. We will now address, in this and the next subsection, some
of the diculties associated with these methods, when embedded within
policy iteration. One diculty has to do with the issue of exploration:
for a variety of reasons it is important to generate trajectories according
to the steady-state distribution associated with the given policy (one
reason is the need to concentrate on important states that are likely
to occur under a near-optimal policy, and another is the desirabiity to
maintain the contraction property of T). On the other hand, this biases
the simulation by underrepresenting states that are unlikely to occur under
, causing potentially serious errors in the calculation of a new policy via
policy improvement.
Another diculty is that Assumption 6.3.1 (the irreducibility of the
transition matrix P of the policy being evaluated) may be hard or impos-
sible to guarantee, in which case the methods break down, either because
of the presence of transient states (in which case the components of cor-
responding to transient states are 0, and these states are not represented
in the constructed approximation), or because of multiple recurrent classes
(in which case some states will never be generated during the simulation,
and again will not be represented in the constructed approximation).
We noted earlier one possibility to introduce a natural form of explo-
ration in a least squares implementation of LSPE() and LSTD(). We
will now discuss another popular approach to address the exploration dif-
culty, which is often used in conjunction with LSTD. This is to modify
the transition matrix P of the given policy by occasionally generating
transitions other than the ones dictated by . If the modied transition
probability matrix is irreducible, we simultaneously address the diculty
with multiple recurrent classes and transient states as well. Mathemati-
cally, in such a scheme we generate an innitely long trajectory (i
0
, i
1
, . . .)
according to an irreducible transition probability matrix
P = (I B)P +BQ, (6.115)
where B is a diagonal matrix with diagonal components
i
[0, 1] and Q
is another transition probability matrix. Thus, at state i, the next state is
generated with probability 1
i
according to transition probabilities p
ij
,
and with probability
i
according to transition probabilities q
ij
[here pairs
(i, j) with q
ij
> 0 need not correspond to physically plausible transitions].
We refer to
i
as the exploration probability at state i.
Unfortunately, using P in place of P for simulation, with no other
modication in the TD algorithms, creates a bias that tends to degrade
the quality of policy evaluation, because it directs the algorithms towards
approximating the xed point of the mapping T
()
, given by
T
()
(J) = g
()
+P
()
J,
where
T
()
(J) = (1 )
t=0
t
T
t+1
(J),
with
T(J) = g +PJ
[cf. Eq. (6.84)]. This is the cost of a dierent policy, a ctitious exploration-
enhanced policy that has a cost vector g with components
g
i
=
n
j=1
p
ij
g(i, j), i = 1, . . . , n,
In the literature, e.g., [SuB98], the policy being evaluated is sometimes
called the target policy to distinguish from a policy modied for exploration like
P, which is called behavior policy. Also, methods that use a behavior policy
are called o-policy methods, while methods that do not are called on-policy
methods. Note, however, that P need not correspond to an admissible policy,
and indeed there may not exist a suitable admissible policy that can induce
sucient exploration.
and a transition probability matrix P in place of P. In particular, when the
simulated trajectory is generated according to P, the LSTD(), LSPE(),
and TD() algorithms yield the unique solution r
of the equation
r = T
()
(r), (6.116)
where denotes projection on the approximation subspace with respect to
| |
, where is the invariant distribution corresponding to P.

We will discuss in this section some schemes that allow the approxi-
mation of the solution of the projected equation
r = T
()
(r), (6.117)
where is projection with respect to the norm | |
, corresponding to
the steady-state distribution of P. Note the dierence between equations
(6.116) and (6.117): the rst involves T but the second involves T, so it aims
to approximate the desired xed point of T, rather than a xed point of T.
Thus, the following schemes allow exploration, but without the degradation
of approximation quality resulting from the use of T in place of T.
Exploration Using Extra Transitions
The rst scheme applies only to the case where = 0. Then a vector r
solves the exploration-enhanced projected equation r = T(r) if and

only if it satises the orthogonality condition
(r
Pr
g) = 0, (6.118)
where is the steady-state distribution of P [cf. Eq. (6.39)]. This condition
can be written in matrix form as
Cr
= d,
where
C =
(I P), d =
g. (6.119)
These equations should be compared with the equations for the case where
P = P [cf. Eqs. (6.40)-(6.41)]: the only dierence is that the distribution
matrix is replaced by the exploration-enhanced distribution matrix .
We generate a state sequence
_
i
0
, i
1
, . . .
_
according to the exploration-
enhanced transition matrix P (or in fact any steady state distribution ,
such as the uniform distribution). We also generate an additional sequence
of independent transitions
_
(i
0
, j
0
), (i
1
, j
1
), . . .
_
according to the original
transition matrix P.
We approximate the matrix C and vector d of Eq. (6.119) using the
formulas
C
k
=
1
k + 1
k
t=0
(i
t
)
_
(i
t
) (j
t
)
_
,
and
d
k
=
1
k + 1
k
t=0
(i
t
)g(i
t
, j
t
),
in place of Eqs. (6.48) and (6.49). Similar to the earlier case in Section
6.3.3, where P = P, it can be shown using law of large numbers arguments
that C
k
C and d
k
d with probability 1.
The corresponding approximation C
k
r = d
k
to the projected equation
r = T(r) can be written as
k
t=0
(i
t
) q
k,t
= 0, (6.120)
where
q
k,t
= (i
t
)
r
k
(j
t
)
r
k
g(i
t
, j
t
)
is a temporal dierence associated with the transition (i
t
, j
t
) [cf. Eq. (6.54)].
The three terms in the denition of q
k,t
can be viewed as samples [asso-
ciated with the transition (i
t
, j
t
)] of the corresponding three terms of the
expression (r
k
Pr
k
g) in Eq. (6.118).
In a modied form of LSTD(0), we approximate the solution C
1
d
of the projected equation with C
1
k
d
k
. In a modied form of (scaled)
LSPE(0), we approximate the term (Cr
k
d) in PVI by (C
k
r
k
d
k
),
leading to the iteration
r
k+1
= r
k

k + 1
G
k
k
t=0
(i
t
) q
k,t
, (6.121)
where is small enough to guarantee convergence [cf. Eq. (6.71)]. Finally,
the modied form of TD(0) is
r
k+1
= r
k

k
(i
k
) q
k,k
, (6.122)
where
k
is a positive diminishing stepsize [cf. Eq. (6.77)]. Unfortunately,
versions of these schemes for > 0 are complicated because of the diculty
of generating extra transitions in a multistep context.
Exploration Using Modied Temporal Dierences
We will now present an alternative exploration approach that works for all
0. Like the preceding approach, it aims to solve the projected equation
r = T
()
(r) [cf. Eq. (6.117)], but it does not require extra transitions.
It does require, however, the explicit knowledge of the transition probabil-
ities p
ij
and p
ij
, so it does not apply to the model-free context. (Later,
in Section 6.5, we will discuss appropriate model-free modications in the
context of Q-learning.)
Here we generate a single state sequence
_
i
0
, i
1
, . . .
_
according to the
exploration-enhanced transition matrix P. The formulas of the various
TD algorithms are similar to the ones given earlier, but we use modied
versions of temporal dierences, dened by
q
k,t
= (i
t
)
r
k

p
i
t
i
t+1
p
i
t
i
t+1
_
(i
t+1
)
r
k
+g(i
t
, i
t+1
)
_
, (6.123)
where p
ij
and p
ij
denote the ijth components of P and P, respectively.
Consider now the case where = 0 and the approximation of the
matrix C and vector d of Eq. (6.119) by simulation: we generate a state
sequence i
0
, i
1
, . . . using the exploration-enhanced transition matrix P.
After collecting k + 1 samples (k = 0, 1, . . .), we form
C
k
=
1
k + 1
k
t=0
(i
t
)
_
(i
t
)
p
i
t
i
t+1
p
i
t
i
t+1
(i
t+1
)
_
,
and
d
k
=
1
k + 1
k
t=0
p
i
t
i
t+1
p
i
t
i
t+1
(i
t
)g(i
t
, i
t+1
).
Similar to the earlier case in Section 6.3.3, where P = P, it can be shown
using simple law of large numbers arguments that C
k
C and d
k
d with
probability 1 (see also Section 6.8.1, where this approximation approach is
discussed within a more general context). Note that the approximation
C
k
r = d
k
to the projected equation can also be written as
k
t=0
(i
t
) q
k,t
= 0,
Note the dierence in the sampling of transitions. Whereas in the preceding
scheme with extra transitions, (it, jt) was generated according to the original
transition matrix P, here (it, it+1) is generated according to the exploration-
enhanced transition matrix P. The approximation of an expected value with
respect to a given distribution (induced by the transition matrix P) by sampling
with respect to a dierent distribution (induced by the exploration-enhanced
transition matrix P) is reminiscent of importance sampling (cf. Section 6.1.5).
The probability ratio
p
i
t
i
t+1
p
i
t
i
t+1
in Eq. (6.123) provides the necessary correction.
where q
k,t
is the modied temporal dierence given by Eq. (6.123) [cf. Eq.
(6.120)].
The exploration-enhanced LSTD(0) method is simply r
k
= C
1
k
d
k
,
and converges with probability 1 to the solution of the projected equation
r = T(r). Exploration-enhanced versions of LSPE(0) and TD(0) can
be similarly derived [cf. Eqs. (6.121) and (6.122)], but for convergence of
these methods, the mapping T should be a contraction, which is guaran-
teed only if P diers from P by a small amount (see the subsequent Prop.
6.3.6).
Let us now consider the case where > 0. We rst note that increas-
ing values of tend to preserve the contraction of T
()
. In fact, given
any norm | |
, T
()
is a contraction with respect to that norm, provided
is suciently close to 1 (see Prop. 6.3.5, which shows that the contraction
modulus of T
()
tends to 0 as 1). This implies that given any explo-
ration probabilities from the range [0, 1] such that P is irreducible, there
exists [0, 1) such that T
()
and T
()
are contractions with respect to
| |
for all [, 1).

Exploration-enhanced versions of LSTD() and LSPE() have been
obtained by Bertsekas and Yu [BeY09], to which we refer for their detailed
development. In particular, the exploration-enhanced LSTD() method
computes r
k
as the solution of the equation C
()
k
r = d
()
k
, with C
()
k
and
d
()
k
generated with recursions similar to the ones with unmodied TD [cf.
Eqs. (6.93)-(6.95)]:
C
()
k
= (1
k
)C
()
k1
+
k
z
k
_
(i
k
)
p
i
k
i
k+1
p
i
k
i
k+1
(i
k+1
)
_
, (6.124)
d
()
k
= (1
k
)d
()
k1
+
k
z
k
p
i
k
i
k+1
p
i
k
i
k+1
g(i
k
, i
k+1
), (6.125)
where z
k
are modied eligibility vectors given by
z
k
=
p
i
k1
i
k
p
i
k1
i
k
z
k1
+(i
k
), (6.126)
the initial conditions are z
1
= 0, C
1
= 0, d
1
= 0, and
k
=
1
k + 1
, k = 0, 1, . . . .
It is possible to show the convergence of r
k
to the solution of the explora-
tion-enhanced projected equation r = T
()
(r), assuming only that
this equation has a unique solution (a contraction property is not neces-
sary since LSTD is not an iterative method, but rather approximates the
projected equation by simulation).
The exploration-enhanced LSPE() iteration is given by
r
k+1
= r
k
G
k
_
C
()
k
r
k
d
()
k
_
, (6.127)
where G
k
is a scaling matrix, is a positive stepsize, and q
k,t
are the modi-
ed temporal dierences (6.123). Convergence of iteration (6.127) requires
that G
k
converges to a matrix G such that I GC
()
is a contraction.
A favorable special case is the iterative regression method [cf. Eqs. (6.89)
and (6.90)]
r
k+1
=
_
C
()
k

1
k
C
()
k
+I
_
1
_
C
()
k

1
k
d
()
k
+r
k
_
. (6.128)
This method converges for any as it does not require that T
()
is a
contraction with respect to | |
. The corresponding LSPE() method

r
k+1
= r
k

_
1
k + 1
k
t=0
(i
t
)(i
t
)
_
1
_
C
()
k
r
k
d
()
k
_
is convergent only if T
()
is a contraction.
We nally note that an exploration-enhanced version of TD() has
been developed in [BeY09] (Section 5.3). It has the form
r
k+1
= r
k

k
z
k
q
k,k
,
where
k
is a stepsize parameter and q
k,k
is the modied temporal dierence
q
k,k
= (i
k
)
r
k

p
i
k
i
k+1
p
i
k
i
k+1
_
(i
k+1
)
r
k
+g(i
k
, i
k+1
)
_
,
[cf. Eqs. (6.98) and (6.123)]. However, this method is guaranteed to con-
verge to the solution of the exploration-enhanced projected equation r =
T
()
(r) only if T
()
is a contraction. We next discuss conditions under
which this is so.
The analysis of the convergence C
()
k
C
()
and d
()
k
d
()
has been
given in several sources under dierent assumptions: (a) In Nedic and Bert-
sekas [NeB03] for the case of policy evaluation in an -discounted problem with
no exploration (P = P). (b) In Bertsekas and Yu [BeY09], assuming that
max
(i,j)
(pij/p
ij
) < 1 (where we adopt the convention 0/0 = 0), in which case
the eligibility vectors z
k
of Eq. (6.126) are generated by a contractive process and
remain bounded. This covers the common case P =
_
1 )P +Q, where > 0
is a constant. The essential restriction here is that should be no more than
(1 ). (c) In Yu [Yu10a,b], for all [0, 1] and no other restrictions, in which
case the eligibility vectors z
k
typically become unbounded as k increases when
max
(i,j)
(aij/ij) > 1. Mathematically, this is the most challenging analysis,
and involves interesting stochastic phenomena.
Contraction Properties of Exploration-Enhanced Methods
We now consider the question whether T
()
is a contraction. This is
important for the corresponding LSPE() and TD()-type methods, which
are valid only if T
()
is a contraction, as mentioned earlier. Generally,
T
()
may not be a contraction. The key diculty here is a potential norm
mismatch: even if T
()
is a contraction with respect to some norm, may
not be nonexpansive with respect to the same norm.
We recall the denition
P = (I B)P +BQ,
[cf. Eq. (6.115)], where B is a diagonal matrix with the exploration proba-
bilities
i
[0, 1] on the diagonal, and Q is another transition probability
matrix. The following proposition quanties the restrictions on the size of
the exploration probabilities in order to avoid the diculty just described.
Since is nonexpansive with respect to | |
, the proof is based on nding

values of
i
for which T
()
is a contraction with respect to | |
. This is
equivalent to showing that the corresponding induced norm of the matrix
P
()
= (1 )
t=0
t
(P)
t+1
(6.129)
[cf. Eq. (6.84)] is less than 1.
Proposition 6.3.6: Assume that P is irreducible and is its invariant
distribution. Then T
()
and T
()
are contractions with respect to
| |
for all [0, 1) provided < 1, where

=

_
1 max
i=1,...,n
i
.
The associated modulus of contraction is at most equal to
(1 )
1
.
Proof: For all z
n
with z ,= 0, we have
|Pz|
2
=
n
i=1
i
_
_
n
j=1
p
ij
z
j
_
_
2
=
2
n
i=1
i
_
_
n
j=1
p
ij
z
j
_
_
2

2
n
i=1
i
n
j=1
p
ij
z
2
j

2
n
i=1
i
n
j=1
p
ij
1
i
z
2
j

2
1
n
j=1
n
i=1
i
p
ij
z
2
j
=
2
n
j=1
j
z
2
j
=
2
|z|
2
,
where the rst inequality follows from the convexity of the quadratic func-
tion, the second inequality follows from the fact (1
i
)p
ij
p
ij
, and the
next to last equality follows from the property
n
i=1
i
p
ij
=
j
of the invariant distribution. Thus, P is a contraction with respect to
| |
with modulus at most .

Next we note that if < 1, the norm of the matrix P
()
of Eq.
(6.129) is bounded by
(1 )
t=0
t
|P|
t+1
(1 )
t=0
t+1
=
(1 )
1
< 1, (6.130)
from which the result follows. Q.E.D.
The preceding proposition delineates a range of values for the explo-
ration probabilities in order for T
()
to be a contraction: it is sucient
that
i
< 1
2
, i = 1, . . . , n,
independent of the value of . We next consider the eect of on the
range of allowable exploration probabilities. While it seems dicult to fully
quantify this eect, it appears that values of close to 1 tend to enlarge
the range. In fact, T
()
is a contraction with respect to any norm | |
and
consequently for any value of the exploration probabilities
i
, provided
is suciently close to 1. This is shown in the following proposition.
Proposition 6.3.7: Given any exploration probabilities from the range
[0, 1] such that P is irreducible, there exists [0, 1) such that T
()
and T
()
are contractions with respect to | |
for all [, 1).

Proof: By Prop. 6.3.6, there exists such that |P|
< 1. From Eq.

(6.130) it follows that lim
1
|P
()
|
= 0 and hence lim

1
P
()
= 0.
It follows that given any norm | |, P
()
is a contraction with respect
to that norm for suciently close to 1. In particular, this is true for
any norm | |
, where is the invariant distribution of an irreducible P

that is generated with any exploration probabilities from the range [0, 1].
Q.E.D.
We nally note that assuming T
()
is a contraction with modulus
=
(1 )
1
,
as per Prop. 6.3.6, we have the error bound
|J

1
_
1
2
|J
,
where r
is the xed point of T

()
. The proof is nearly identical to the
one of Prop. 6.3.5.
6.3.8 Policy Oscillations Chattering
We will now describe a generic mechanism that tends to cause policy os-
cillations in approximate policy iteration. To this end, we introduce the so
called greedy partition. For a given approximation architecture

J(, r), this
is a partition of the space
s
of parameter vectors r into subsets R
, each
subset corresponding to a stationary policy , and dened by
R
=
_
r [ T
(r) = T(r)
_
or equivalently
R
=
_
_
_
r
(i) = arg min

uU(i)
n
j=1
p
ij
(u)
_
g(i, u, j) +

J(j, r)
_
, i = 1, . . . , n
_
_
_
.
Thus, R
is the set of parameter vectors r for which is greedy with respect

to

J(, r).
We rst consider the nonoptimistic version of approximate policy iter-
ation. For simplicity, let us assume that we use a policy evaluation method
(e.g., a projected equation or other method) that for each given produces
a unique parameter vector denoted r
. Nonoptimistic policy iteration starts

with a parameter vector r
0
, which species
0
as a greedy policy with re-
spect to

J(, r
0
), and generates r
0 by using the given policy evaluation

method. It then nds a policy
1
that is greedy with respect to

J(, r
0 ),
i.e., a
1
such that
r
0 R
1 .
It then repeats the process with
1
replacing
0
. If some policy
k
satisfying
r
k R
k (6.131)
is encountered, the method keeps generating that policy. This is the nec-
essary and sucient condition for policy convergence in the nonoptimistic
policy iteration method.
r
k
k r
k+1
+1 r
k+2
+2 r
k+3
R
k
R
k+1
R
k+2
+2 R
k+3
Figure 6.3.4 Greedy partition and cycle
of policies generated by nonoptimistic pol-
icy iteration with cost function approxima-
tion. In particular, yields by policy
improvement if and only if r R
. In
this gure, the method cycles between four
policies and the corresponding four param-
eters r
k, r
k+1 , r
k+2 , and r
k+3 .
In the case of a lookup table representation where the parameter
vectors r
are equal to the cost-to-go vector J
, the condition r
k R
k is
equivalent to r
k = Tr
k , and is satised if and only if

k
is optimal. When
there is cost function approximation, however, this condition need not be
satised for any policy. Since there is a nite number of possible vectors
r
, one generated from another in a deterministic way, the algorithm ends

up repeating some cycle of policies
k
,
k+1
, . . . ,
k+m
with
r
k R
k+1, r
k+1 R
k+2, . . . , r
k+m1 R
k+m, r
k+m R
k;
(6.132)
(see Fig. 6.3.4). Furthermore, there may be several dierent cycles, and
the method may end up converging to any one of them. The actual cy-
cle obtained depends on the initial policy
0
. This is similar to gradient
methods applied to minimization of functions with multiple local minima,
where the limit of convergence depends on the starting point.
We now turn to examine policy oscillations in optimistic variants of
policy evaluation methods with function approximation. Then the trajec-
tory of the method is less predictable and depends on the ne details of
the iterative policy evaluation method, such as the frequency of the pol-
icy updates and the stepsize used. Generally, given the current policy ,
optimistic policy iteration will move towards the corresponding target
parameter r
, for as long as continues to be greedy with respect to the

current cost-to-go approximation

J(, r), that is, for as long as the current
parameter vector r belongs to the set R
. Once, however, the parameter

r crosses into another set, say R
, the policy becomes greedy, and r

changes course and starts moving towards the new target r
. Thus, the
targets r
of the method, and the corresponding policies and sets R
may keep changing, similar to nonoptimistic policy iteration. Simultane-

ously, the parameter vector r will move near the boundaries that separate
the regions R
that the method visits, following reduced versions of the

cycles that nonoptimistic policy iteration may follow (see Fig. 6.3.5). Fur-
thermore, as Fig. 6.3.5 shows, if diminishing parameter changes are made
between policy updates (such as for example when a diminishing stepsize
is used by the policy evaluation method) and the method eventually cycles
between several policies, the parameter vectors will tend to converge to
the common boundary of the regions R
corresponding to these policies.

This is the so-called chattering phenomenon for optimistic policy iteration,
whereby there is simultaneously oscillation in policy space and convergence
in parameter space.
An additional insight is that the choice of the iterative policy evalu-
ation method (e.g., LSTD, LSPE, or TD for various values of ) makes a
dierence in rate of convergence, but does not seem crucial for the quality
of the nal policy obtained (as long as the methods converge). Using a
dierent value of changes the targets r
somewhat, but leaves the greedy

partition unchanged. As a result, dierent methods sh in the same wa-
ters and tend to yield similar ultimate cycles of policies.
The following is an example of policy oscillations and chattering.
Other examples are given in Section 6.4.2 of [BeT96] (Examples 6.9 and
6.10).
Example 6.3.2 (Policy Oscillation and Chattering)
Consider a discounted problem with two states, 1 and 2, illustrated in Fig.
6.3.6(a). There is a choice of control only at state 1, and there are two policies,
denoted
and . The optimal policy
, when at state 1, stays at 1 with

probability p > 0 and incurs a negative cost c. The other policy is and
cycles between the two states with 0 cost. We consider linear approximation
with a single feature (i)
= i for each of the states i = 1, 2, i.e.,

=
_
1
2
_
,

J = r =
_
r
2r
_
.
r
1
1 r
2
2 r
3
R
1
R
2
2 R
3
Figure 6.3.5 Illustration of a trajectory of
optimistic policy iteration with cost function
approximation. The algorithm settles into an
oscillation between policies
1
,
2
,
3
with
r
1 R
2 , r
2 R
3 , r
3 R
1 . The
parameter vectors converge to the common
boundary of these policies.
Cost =0 Cost =
Stages
Cost =0 Cost =
Stages
Cost =0 Cost =
Stages
Cost =0 Cost =
Stages
0 Prob. = 1 p Prob. = 1 Prob. =
p Prob. = 1 Prob. =
p Prob. = 1 Prob. =
p Prob. = 1 Prob. =
Cost =0 Cost = c < 0 Prob. = 1
Stages
Prob. = 1 Prob. = p
Policy Policy Policy
(a) (b)
(a) (b)
c
r
= 0
c

c
1
,
= 0 R

R
1 2 1 2 1 2 1 2
Figure 6.3.6 The problem of Example 6.3.2. (a) Costs and transition prob-
abilities for the policies and
. (b) The greedy partition and the solutions

of the projected equations corresponding to and
. Nonoptimistic policy
iteration oscillates between r and r
.
Let us construct the greedy partition. We have
=
_
1
2
_
,

J = r =
_
r
2r
_
.
We next calculate the points r and r
that solve the projected equations

Cr = d, C
= d
,
which correspond to and
, respectively [cf. Eqs. (6.40), (6.41)]. We have

C =
(1 P) = ( 1 2 )
_
1 0
0 1
__
1
a 1
__
1
2
_
= 5 9,
d =
g = ( 1 2 )
_
1 0
0 1
__
0
0
_
= 0,
so
r = 0.
Similarly, with some calculation,
C
(1 P
)
= ( 1 2 )
_
1
2p
0
0
1p
2p
__
1 p (1 p)
a 1
__
1
2
_
=
5 4p (4 3p)
2 p
,
d
= ( 1 2 )
_
1
2p
0
0
1p
2p
__
c
0
_
=
c
2 p
,
so
r
=
c
5 4p (4 3p)
.
We now note that since c < 0,
r = 0 R
,
while for p 1 and > 1 , we have
r

c
1
R;
cf. Fig. 6.3.6(b). In this case, approximate policy iteration cycles between
and
. Optimistic policy iteration uses some algorithm that moves the

current value r towards r
if r R
, and towards r if r R. Thus

optimistic policy iteration starting from a point in R moves towards r
and once it crosses the boundary point c/ of the greedy partition, it reverses
course and moves towards r. If the method makes small incremental changes
in r before checking whether to change the current policy, it will incur a small
oscillation around c/. If the incremental changes in r are diminishing, the
method will converge to c/. Yet c/ does not correspond to any one of the
two policies and has no meaning as a desirable parameter value.
Notice that it is hard to predict when an oscillation will occur and what
kind of oscillation it will be. For example if c > 0, we have
r = 0 R,
while for p 1 and > 1 , we have
r

c
1
R
.
In this case approximate as well as optimistic policy iteration will converge
to (or
) if started with r in R (or R
, respectively).
When chattering occurs, the limit of optimistic policy iteration tends
to be on a common boundary of several subsets of the greedy partition
and may not meaningfully represent a cost approximation of any of the
corresponding policies, as illustrated by the preceding example. Thus, the
limit to which the method converges cannot always be used to construct
an approximation of the cost-to-go of any policy or the optimal cost-to-
go. As a result, at the end of optimistic policy iteration and in contrast
with the nonoptimistic version, one must go back and perform a screening
process; that is, evaluate by simulation the many policies generated by the
method starting from the initial conditions of interest and select the most
promising one. This is a disadvantage of optimistic policy iteration that
may nullify whatever practical rate of convergence advantages it may have
over its nonoptimistic counterpart.
We note, however, that computational experience indicates that for
many problems, the cost functions of the dierent policies involved in chat-
tering may not be too dierent. Indeed, suppose that we have conver-
gence to a parameter vector r and that there is a steady-state policy os-
cillation involving a collection of policies /. Then, all the policies in /
are greedy with respect to

J(, r), which implies that there is a subset of
states i such that there are at least two dierent controls
1
(i) and
2
(i)
satisfying
min
uU(i)
j
p
ij
(u)
_
g(i, u, j) +

J(j, r)
_
=
j
p
ij
_
1
(i)
_
_
g
_
i,
1
(i), j
_
+

J(j, r)
_
=
j
p
ij
_
2
(i)
_
_
g
_
i,
2
(i), j
_
+

J(j, r)
_
.
(6.133)
Each equation of this type can be viewed as a constraining relation on the
parameter vector r. Thus, excluding singular situations, there will be at
most s relations of the form (6.133) holding, where s is the dimension of r.
This implies that there will be at most s ambiguous states where more
than one control is greedy with respect to

J(, r) (in Example 6.3.6, state
1 is ambiguous).
Now assume that we have a problem where the total number of states
is much larger than s, and in addition there are no critical states; that is,
the cost consequences of changing a policy in just a small number of states
(say, of the order of s) is relatively small. It then follows that all policies in
the set / involved in chattering have roughly the same cost. Furthermore,
for the methods of this section, one may argue that the cost approximation
J(, r) is close to the cost approximation

J(, r
) that would be generated

for any of the policies /. Note, however, that the assumption of no
critical states, aside from the fact that it may not be easily quantiable,
it will not be true for many problems.
While the preceding argument may explain some of the observed em-
pirical behavior, an important concern remains: even if all policies involved
in chattering have roughly the same cost, it is still possible that none of
them is particularly good; the policy iteration process may be just cycling
in a bad part of the greedy partition. An interesting case in point is
the game of tetris, which has been used as a testbed for approximate DP
methods [Van95], [TsV96], [BeI96], [Kak02], [FaV06], [SzL06], [DFM09].
Using a set of 22 features and approximate policy iteration with policy
evaluation based on the projected equation and the LSPE method [BeI96],
an average score of a few thousands was achieved. Using the same fea-
tures and a random search method in the space of weight vectors r, an
average score of over 900,000 was achieved [ThS09]. This suggests that
in the tetris problem, policy iteration using the projected equation is seri-
ously hampered by oscillations/chattering between relatively poor policies,
roughly similar to the attraction of gradient methods to poor local minima.
The full ramications of policy oscillation in practice are not fully under-
stood at present, but it is clear that they give serious reason for concern.
Moreover local minima-type phenomena may be causing similar diculties
in other related approximate DP methodologies: approximate policy iter-
ation with the Bellman error method (see Section 6.8.4), policy gradient
methods (see Section 6.9), and approximate linear programming (the tetris
problem, using the same 22 features, has been addressed by approximate
linear programming [FaV06], [DFM09], and with a policy gradient method
[Kak02], also with an achieved average score of a few thousands).
Conditions for Policy Convergence
The preceding discussion has illustrated the detrimental eects of policy
oscillation in approximate policy iteration. Another reason why conver-
gence of policies may be desirable has to do with error bounds. Generally,
in approximate policy iteration, by Prop. 1.3.6, we have an error bound of
the form
limsup
k
|J
k J

2
(1 )
2
,
where satises
|J
k
J
k |

for all generated policies
k
and J
k
is the approximate cost vector of
k
that is used for policy improvement. However, when the policy sequence
k
terminates with some and the policy evaluation is accurate to within
(in the sup-norm sense |r
, for all ), then one can show

the much sharper bound
|J

2
1
. (6.134)
For a proof, let J be the cost vector r
obtained by policy evaluation of

, and note that it satises |J J
(by our assumption on the

accuracy of policy evaluation) and TJ = T
J (since
k
terminates at ).
We write
TJ
T(Je) = TJe = T
Je T
(J
e)e = T
2e,
and since T
= J
, we obtain TJ
2e. From this, by applying

T to both sides, we obtain
T
2
J
TJ
2
2
e J
2(1 +)e,
and by similar continued application of T to both sides,
J
= lim
m
T
m
J

2
1
e,
thereby showing the error bound (6.134).
In view of the preceding discussion, it is interesting to investigate
conditions under which we have convergence of policies. From the mathe-
matical point of view, it turns out that policy oscillation is caused by the
lack of monotonicity and the dependence (through ) on of the projection
operator. With this in mind, we will replace with a constant operator
W that has a monotonicity property. Moreover, it is simpler both concep-
tually and notationally to do this in a broader and more abstract setting
that transcends discounted DP problems, thereby obtaining a more general
approximate policy iteration algorithm.
To this end, consider a method involving a (possibly nonlinear) map-
ping H
:
n

n
, parametrized by the policy , and the mapping
H :
n

n
, dened by
HJ = min
M
H
J, (6.135)
where / is a nite subset of policies, and the minimization above is done
separately for each component of H
J, i.e.,
(HJ)(i) = min
M
(H
J)(i), i = 1, . . . , n.
Abstract mappings of this type and their relation to DP have been studied
in Denardo [Den67], Bertsekas [Ber77], and Bertsekas and Shreve [BeS78].
The discounted DP case corresponds to H
= T
and H = T. Another
special case is a mapping H
that is similar to T
but arises in discounted

semi-Markov problems. Nonlinear mappings H
also arise in the context

of minimax DP problems and sequential games; see Shapley [Sha53], and
[Den67], [Ber77], [BeS78].
We will construct a policy iteration method that aims to nd an
approximation to a xed point of H, and evaluates a policy / with a
solution

J
of the following xed point equation in the vector J:

(WH
)(J) = J, (6.136)
where W :
n

n
is a mapping (possibly nonlinear, but independent
of ). Policy evaluation by solving the projected equation corresponds
to W = . Rather than specify properties of H
under which H has a

unique xed point (as in [Den67], [Ber77], and [BeS78]), it is simpler for
our purposes to introduce corresponding assumptions on the mappings W
and WH
. In particular, we assume the following:

(a) For each J, the minimum in Eq. (6.135) is attained, in the sense that
there exists / such that HJ = H
J.
(b) For each /, the mappings W and WH
are monotone in the

sense that
WJ WJ, (WH
)(J) (WH
)(J), J,

J
n
with J

J.
(6.137)
(c) For each , the solution

J
of Eq. (6.136) is unique, and for all J

such that (WH
)(J) J, we have
= lim
k
(WH
)
k
(J).
Based on condition (a), we introduce a policy improvement operation that
is similar to the case where H
= T
, i.e., the improved policy satises

H
= H

J
. Note that condition (c) is satised if WH
is a contrac-
tion, while condition (b) is satised if W is a matrix with nonnegative
components and H
is monotone for all .

Proposition 6.3.8: Let the preceding conditions (a)-(c) hold. Con-
sider the policy iteration method that uses the xed point

J
of the
mapping WH
for evaluation of the policy [cf. Eq. (6.136)], and

the equation H
= H

J
for policy improvement. Assume that the

method is initiated with some policy in /, and it is operated so that
it terminates when a policy is obtained such that H
= H

J
.
Then the method terminates in a nite number of iterations, and the
vector

J
obtained upon termination is a xed point of WH.

Proof: Similar to the standard proof of convergence of (exact) policy it-
eration, we use the policy improvement equation H
= H

J
, the mono-
tonicity of W, and the policy evaluation Eq. (6.136) to write
(WH
)(
) = (WH)(
) (WH
)(
) =

J
.
By iterating with the monotone mapping WH
and by using condition (c),

we obtain
= lim
k
(WH
)
k
(
)

J
.
There are nitely many policies, so we must have

J
=

J
after a nite
number of iterations, which using the policy improvement equation H
=
H

J
, implies that H
= H

J
. Thus the algorithm terminates with ,

and since

J
= (WH
)(
), it follows that

J
is a xed point of WH.

Q.E.D.
An important special case where Prop. 6.3.8 applies and policies con-
verge is when H
= T
, H = T, W is linear of the form W = D, where

and D are n s and s n matrices, respectively, whose rows are proba-
bility distributions, and the policy evaluation uses the linear feature-based
approximation

J
= r
. This is the case of policy evaluation by aggre-

gation, which will be discussed in Section 6.4. Then it can be seen that
W is monotone and that WT
is a sup-norm contraction (since W is non-

expansive with respect to the sup-norm), so that conditions (a)-(c) are
satised.
Policy convergence as per Prop. 6.3.8 is also attained in the more
general case where W = D, with the matrix W having nonnegative com-
ponents, and row sums that are less than or equal to 1, i.e.,
s
m=1
im
D
mj
0, i, j = 1, . . . , n, (6.138)
s
m=1
im
n
j=1
D
mj
1, i = 1, . . . , n. (6.139)
If and D have nonnegative components, Eq. (6.138) is automatically
satised, while Eq. (6.139) is equivalent to the set of n linear inequalities
(i)
1, i = 1, . . . , n, (6.140)
where (i)
is the ith row of , and

s
is the column vector of row
sums of D, i.e., the one that has components
(m) =
n
j=1
D
mj
, m = 1, . . . , s.
A column of that has both positive and negative components may be
replaced with the two columns that contain its positive and the opposite of its
negative components. This will create a new nonnegative matrix with as
many as twice the number of columns, and will also enlarge the approximation
subspace S (leading to no worse approximation). Then the matrix D may be
optimized subject to D 0 and the constraints (6.140), with respect to some
performance criterion. Given a choice of 0, an interesting question is how
to construct eective algorithms for parametric optimization of a nonnegative
Even in this more general case, the policy evaluation Eq. (6.136) can be
solved by using simulation and low order calculations (see Section 6.8).
A special case arises when through a reordering of indexes, the matrix
D can be partitioned in the form D = ( 0 ), where is a positive
denite diagonal matrix with diagonal elements
m
, m = 1, . . . , s, satisfying
s
m=1
im
m
1, i = 1, . . . , n.
An example of a structure of this type arises in coarse grid discretiza-
tion/aggregation schemes (Section 6.4).
When the projected equation approach is used (W = where is
a projection matrix) and the mapping H
is monotone in the sense that

H
J H
J for all J, J
n
with J J (as it typically is in DP models),
then the monotonicity assumption (6.137) is satised if is independent
of the policy and satises J 0 for all J with J 0. This is true in
particular when both and (
)
1
have nonnegative components, in
view of the projection formula = (
)
1
. A special case is when

is nonnegative and has linearly independent columns that are orthogonal
with respect to the inner product < x
1
, x
2
>= x
1
x
2
, in which case
is positive denite and diagonal.

An example of the latter case is hard aggregation, where the state
space 1, . . . , n is partitioned in s nonempty subsets I
1
, . . . , I
s
and (cf.
Section 6.4, and Vol. I, Section 6.3.4):
(1) The th column of has components that are 1 or 0 depending on
whether they correspond to an index in I
or not.
(2) The th row of D is a probability distribution (d
1
, . . . , d
n
) whose
components are positive depending on whether they correspond to
an index in I
or not, i.e.,
n
j=1
d
j
= 1, d
j
> 0 if j I
, and d
j
= 0
if j / I
.
With these denitions of and D, it can be veried that W is given by
the projection formula
W = D = = (
)
1
,
matrix W = D, subject to the constraints (6.138)-(6.139). One possibility is to
use
D = M
, W = M
,
where M is a positive denite diagonal replacement/approximation of (
)
1
in the projection formula = (
)
1
, and > 0 is a scalar parame-

ter that is adjusted to ensure that the condition (c) of Prop. 6.3.8 is satised.
Note that
may be easily computed by simulation, but since W should be

independent of , the same should be true for .
where is the diagonal matrix with the nonzero components of D along
the diagonal. In fact can be written in the explicit form
(J)(i) =
jI
d
j
J(j), i I
, = 1, . . . , s.
Thus and have nonnegative components and assuming that D (and
hence ) is held constant, policy iteration converges.
6.3.9 -Policy Iteration
In this section we return to the idea of optimistic policy iteration and we
discuss an alternative method, which connects with TD and with the multi-
step -methods. We rst consider the case of a lookup table representation,
and we discuss later the case where we use cost function approximation.
We view optimistic policy iteration as a process that generates a se-
quence of cost vector-policy pairs
_
(J
k
,
k
)
_
, starting with some (J
0
,
0
).
At iteration k, we generate an improved policy
k+1
satisfying
T
k+1J
k
= TJ
k
, (6.141)
and we then compute J
k+1
as an approximate evaluation of the cost vector
J
k+1 of
k+1
. In the optimistic policy iteration method that we have
discussed so far, J
k+1
is obtained by several, say m
k
, value iterations using
k+1
:
J
k+1
= T
m
k
k+1
J
k
. (6.142)
We now introduce another method whereby J
k+1
is instead obtained by a
single value iteration using the mapping T
()
k+1
:
J
k+1
= T
()
k+1
J
k
, (6.143)
where for any and (0, 1),
T
()
= (1 )
=0
T
+1
.
This is the mapping encountered in Section 6.3.6 [cf. Eqs. (6.83)-(6.84)]:
T
()
J = g
()
+P
()
J, (6.144)
where
P
()
= (1 )
=0
P
+1
, g
()
=0
= (I P
)
1
g
.
(6.145)
We call the method of Eqs. (6.141) and (6.143) -policy iteration, and we
will show shortly that its properties are similar to the ones of the standard
method of Eqs. (6.141), (6.142).
Indeed, both mappings T
m
k
k+1
and T
()
k+1
appearing in Eqs. (6.142)
and (6.143), involve multiple applications of the value iteration mapping
T
k+1: a xed number m

k
in the former case (with m
k
= 1 corresponding
to value iteration and m
k
corresponding to policy iteration), and
an exponentially weighted number in the latter case (with = 0 corre-
sponding to value iteration and 1 corresponding to policy iteration).
Thus optimistic policy iteration and -policy iteration are similar: they
just control the accuracy of the approximation J
k+1
J
k+1 by applying
value iterations in dierent ways.
The following proposition provides some basic properties of -policy
iteration.
Proposition 6.3.9: Given [0, 1), J
k
, and
k+1
, consider the
mapping W
k
dened by
W
k
J = (1 )T
k+1J
k
+T
k+1J. (6.146)
(a) The mapping W
k
is a sup-norm contraction of modulus .
(b) The vector J
k+1
generated next by the -policy iteration method
of Eqs. (6.141), (6.143) is the unique xed point of W
k
.
Proof: (a) For any two vectors J and J, using the denition (6.146) of
W
k
, we have
|W
k
JW
k
J| =
_
_
(T
k+1JT
k+1J)
_
_
= |T
k+1JT
k+1J| |JJ|,
where | | denotes the sup-norm, so W
k
is a sup-norm contraction with
modulus .
(b) We have
J
k+1
= T
()
k+1
J
k
= (1 )
=0
T
+1
k+1
J
k
so the xed point property to be shown, J
k+1
= W
k
J
k+1
, is written as
(1 )
=0
T
+1
k+1
J
k
= (1 )T
k+1J
k
+T
k+1(1 )
=0
T
+1
k+1
J
k
,
and evidently holds. Q.E.D.
From part (b) of the preceding proposition, we see that the equation
dening J
k+1
is
J
k+1
(i) =
n
j=1
p
ij
_
k+1
(i)
_
_
_
g(i,
k+1
(i), j
_
+ (1 )J
k
(j)
+J
k+1
(j)
_
.
(6.147)
The solution of this equation can be obtained by viewing it as Bellmans
equation for two equivalent MDP.
(a) As Bellmans equation for an innite-horizon -discounted MDP
where
k+1
is the only policy, and the cost per stage is
g
_
i,
k+1
(i), j
_
+ (1 )J
k
(j).
(b) As Bellmans equation for an innite-horizon optimal stopping prob-
lem where
k+1
is the only policy. In particular, J
k+1
is the cost
vector of policy
k+1
in an optimal stopping problem that is derived
from the given discounted problem by introducing transitions from
each state j to an articial termination state. More specically, in
this stopping problem, transitions and costs occur as follows: at state
i we rst make a transition to j with probability p
ij
_
k+1
(i)
_
; then
we either stay in j and wait for the next transition (this occurs with
probability ), or else we move from j to the termination state with an
additional termination cost J
k
(j) (this occurs with probability 1).
Note that the two MDP described above are potentially much easier
than the original, because they involve a smaller eective discount factor
( versus ). The two interpretations of -policy iteration in terms of
these MDP provide options for approximate simulation-based solution us-
ing cost function approximation, which we discuss next. The approximate
solution can be obtained by using the projected equation approach of this
section, or another methodology such as the aggregation approach of Sec-
tion 6.4. Moreover the solution may itself be approximated with a nite
number of value iterations, i.e., the algorithm
J
k+1
= W
m
k
k
J
k
, T
k+1J
k
= TJ
k
, (6.148)
in place of Eqs. (6.141), (6.143), where W
k
is the mapping (6.146) and
m
k
> 1 is an integer. These value iterations may converge fast because
they involve the smaller eective discount factor .
Implementations of -Policy Iteration
We will now discuss three alternative simulation-based implementations
with cost function approximation J r and projection. The rst imple-
mentation is based on the formula J
k+1
= T
()
k+1
J
k
. This is just a single
iteration of PVI() for evaluating J
k+1, and can be approximated by a

single iteration of LSPE():
r
k+1
= T
()
k+1
(r
k
).
It can be implemented in the manner discussed in Section 6.3.6, with a
simulation trajectory generated by using
k+1
.
The second implementation is based on a property mentioned earlier:
Eq. (6.147) is Bellmans equation for the policy
k+1
in the context of an
optimal stopping problem. Thus to compute a function approximation to
J
k+1
, we may nd by simulation an approximate solution of this equation,
by using a function approximation to J
k
and the appropriate cost function
approximation methods for stopping problems. We will discuss methods
of this type in Section 6.6, including analogs of LSTD(), LSPE(), and
TD(). We will see that these methods are often able to deal much more
comfortably with the issue of exploration, and do not require elaborate
modications of the type discussed in Section 6.4.1, particularly they in-
volve a relatively short trajectory from any initial state to the termination
state, followed by restart from a randomly chosen state (see the comments
at the end of Section 6.6). Here the termination probability at each state
is 1 , so for not very close to 1, the simulation trajectories are short.
When the details of this implementation are eshed out it we obtain the
exploration-enhanced version of LSPE() described in Section 6.3.6 (see
[Ber11b]).
The third implementation, suggested and tested by Thiery and Scher-
rer [ThS10a], is based on the xed point property of J
k+1
[cf. Prop. 6.3.9(b)],
and uses the projected version of the equation W
k
J = (1 )T
k+1J
k
+
T
k+1J [cf. Eq. (6.146)]

r = (1 )T
k+1(r
k
) + T
k+1(r), (6.149)
or equivalently
r =
_
g
k+1 +(1 )P
k+1r
k
+P
k+1 r
_
.
It can be seen that this is a projected equation in r, similar to the one
discussed in Section 6.3.1 [cf. Eq. (6.37)]. In particular, the solution r
k+1
solves the orthogonality equation [cf. Eq. (6.39)]
Cr = d(k),
where
C =
(I P
k+1 ), d(k) =
k+1 + (1 )
k+1r
k
,
so that
r
k+1
= C
1
d(k).
In a simulation-based implementation, the matrix C and the vector d(k)
are approximated similar to LSTD(0). However, r
k+1
as obtained by this
method, aims to approximate r
0
, the limit of TD(0), not r
, the limit
of TD(). To see this, suppose that this iteration is repeated an innite
number of times so it converges to a limit r
. Then from Eq. (6.149), we

have
r
= (1 )T
k+1(r
) +T
k+1(r
).
which shows that r
= T
k+1(r
), so r
= r
0
. Indeed the approxima-
tion via projection in this implementation is somewhat inconsistent: it is
designed so that r
k+1
is an approximation to T
()
k+1
(r
k
) yet as 1,
from Eq. (6.149) we see that r
k+1
r
0
, not r
. Thus it would appear

that while this implementation deals well with the issue of exploration, it
may not deal well with the issue of bias. For further discussion, we refer to
Bertsekas [Ber11b].
Convergence and Convergence Rate of -Policy Iteration
The following proposition shows the validity of the -policy iteration method
and provides its convergence rate for the case of a lookup table represen-
tation. A similar result holds for the optimistic version (6.148).
Proposition 6.3.10: (Convergence for Lookup Table Case)
Assume that [0, 1), and let (J
k
,
k
) be the sequence generated
by the -policy iteration algorithm of Eqs. (6.141), (6.143). Then J
k
converges to J
. Furthermore, for all k greater than some index k, we

have
|J
k+1
J
|
(1 )
1
|J
k
J
|, (6.150)
where | | denotes the sup-norm.
Proof: Let us rst assume that TJ
0
J
0
. We show by induction that for
all k, we have
J
TJ
k+1
J
k+1
TJ
k
J
k
. (6.151)
To this end, we x k and we assume that TJ
k
J
k
. We will show that
J
TJ
k+1
J
k+1
TJ
k
, and then Eq. (6.151) will follow from the
hypothesis TJ
0
J
0
.
Using the fact T
k+1J
k
= TJ
k
and the denition of W
k
[cf. Eq.
(6.146)], we have
W
k
J
k
= T
k+1J
k
= TJ
k
J
k
.
It follows from the monotonicity of T
k+1, which implies monotonicity of

W
k
, that for all positive integers , we have W
+1
k
J
k
W
k
J
k
TJ
k
J
k
,
so by taking the limit as , we obtain
J
k+1
TJ
k
J
k
. (6.152)
From the denition of W
k
, we have
W
k
J
k+1
= T
k+1J
k
+(T
k+1J
k+1
T
k+1J
k
)
= T
k+1J
k+1
+ (1 )(T
k+1 J
k
T
k+1J
k+1
),
Using the already shown relation J
k
J
k+1
0 and the monotonicity of
T
k+1, we obtain T
k+1J
k
T
k+1J
k+1
0, so that
T
k+1J
k+1
W
k
J
k+1
.
Since W
k
J
k+1
= J
k+1
, it follows that
TJ
k+1
T
k+1J
k+1
J
k+1
. (6.153)
Finally, the above relation and the monotonicity of T
k+1 imply that

for all positive integers , we have T
k+1
J
k+1
T
k+1J
k+1
, so by taking
the limit as , we obtain
J
k+1 T
k+1J
k+1
. (6.154)
From Eqs. (6.152)-(6.154), we see that the inductive proof of Eq. (6.151)
is complete.
From Eq. (6.151), it follows that the sequence J
k
converges to some
limit

J with J

J. Using the denition (6.146) of W
k
, and the facts
J
k+1
= W
k
J
k+1
and T
k+1J
k
= TJ
k
, we have
J
k+1
= W
k
J
k+1
= TJ
k
+(T
k+1J
k+1
T
k+1J
k
),
so by taking the limit as k and by using the fact J
k+1
J
k
0, we
obtain

J = T

J. Thus

J is a solution of Bellmans equation, and it follows
that

J = J
.
To show the result without the assumption TJ
0
J
0
, note that we
can replace J
0
by a vector

J
0
= J
0
+ se, where e = (1, . . . , 1) and s is a
scalar that is suciently large so that we have T

J
0

J
0
; it can be seen that
for any scalar s (1 )
1
max
i
_
TJ
0
(i) J
0
(i)
_
, the relation T

J
0

J
0
holds. Consider the -policy iteration algorithm started with (

J
0
,
0
), and
let (

J
k
,
k
) be the generated sequence. Then it can be veried by induction
that for all k we have
J
k
J
k
=
_
(1 )
1
_
k
s,
k
=
k
.
Hence

J
k
J
k
0. Since we have already shown that

J
k
J
, it follows
that J
k
J
as well.
Since J
k
J
, it follows that for all k larger than some index k,

k+1
is an optimal policy, so that T
k+1J
= TJ
= J
. By using this fact and

Prop. 6.3.5, we obtain for all k k,
|J
k+1
J
| =
_
_
T
()
k+1
J
k
T
()
k+1
J
_
_
(1 )
1
|J
k
J
|.
Q.E.D.
For the case of cost function approximation, we have the following er-
ror bound, which resembles the one for approximate policy iteration (Prop.
1.3.6 in Chapter 1).
Proposition 6.3.11: (Error Bound for Cost Function Approx-
imation Case) Let
be a sequence of nonnegative scalars with
=0
= 1. Consider an algorithm that obtains a sequence of cost

vector-policy pairs
_
(J
k
,
k
)
_
, starting with some (J
0
,
0
), as follows:
at iteration k, it generates an improved policy
k+1
satisfying
T
k+1
J
k
= TJ
k
,
and then it computes J
k+1
by some method that satises
_
_
_
_
_
J
k+1

=0
T
+1
k+1
J
k
_
_
_
_
_
,
where is some scalar. Then we have
limsup
k
|J
k
J

2
(1 )
2
.
For the proof of the proposition, we refer to Thiery and Scherrer
[ThS10b]. Note that the proposition applies to both the standard optimistic
policy iteration method (
= 1 for a single value of and
= 0 for all
other values), and the -policy iteration method [
= (1 )
].
6.3.10 A Synopsis
Several algorithms for approximate evaluation of the cost vector J
of a
single stationary policy in nite-state discounted problems have been
given so far, and we will now summarize the analysis. We will also explain
what can go wrong when the assumptions of this analysis are violated. We
have focused on two types of algorithms:
(1) Direct methods, such as the batch and incremental gradient methods
of Section 6.2, including TD(1). These methods allow for a nonlinear
approximation architecture, and for a lot of exibility in the collection
of the cost samples that are used in the least squares optimization.
For example, in direct methods, issues of exploration do not interfere
with issues of convergence. The drawbacks of direct methods are that
they are not well-suited for problems with large variance of simulation
noise, and they can also be very slow when implemented using
gradient-like methods. The former diculty is in part due to the lack
of the parameter , which is used in other methods to reduce the
variance/noise in the parameter update formulas.
(2) Indirect methods that are based on solution of a projected version
of Bellmans equation. These are simulation-based methods that in-
clude approximate matrix inversion methods such as LSTD(), and
iterative methods such as LSPE() and its scaled versions, and TD()
(Sections 6.3.1-6.3.6).
The salient characteristics of our analysis of indirect methods are:
(a) For a given choice of [0, 1), all indirect methods aim to com-
pute r
, the unique solution of the projected Bellman equation r =

T
()
(r). This equation is linear, of the form C
()
r = d
()
, and
expresses the orthogonality of the vector r T
()
(r) and the ap-
proximation subspace S.
(b) We may use simulation and low-order matrix-vector calculations to
approximate C
()
and d
()
with a matrix C
()
k
and vector d
()
k
, re-
spectively. The simulation may be supplemented with exploration
enhancement, which suitably changes the projection norm to ensure
adequate weighting of all states in the cost approximation. This is
important in the context of policy iteration, as discussed in Section
6.3.7.
(c) The approximations C
()
k
and d
()
k
can be used in both types of meth-
ods: matrix inversion and iterative. The principal example of a matrix
inversion method is LSTD(), which simply computes the solution
r
k
= (C
()
k
)
1
d
()
k
(6.155)
of the simulation-based approximation C
()
k
r = d
()
k
of the projected
equation. Principal examples of iterative methods is LSPE() and its
scaled versions,
r
k+1
= r
k
G
k
_
C
()
k
r
k
d
()
k
_
. (6.156)
TD() is another major iterative method. It diers in an important
way from LSPE(), namely it uses single-sample approximations of
C
()
and d
()
, which are much less accurate than C
()
k
and d
()
k
, and as
a result it requires a diminishing stepsize to deal with the associated
noise. A key property for convergence to r
of TD() and the unscaled

form of LSPE() (without exploration enhancement) is that T
()
is a
contraction with respect to the projection norm | |
, which implies
that T
()
is also a contraction with respect to the same norm.
(d) LSTD() and LSPE() are connected through the regularized regres-
sion-based form (6.90), which aims to deal eectively with cases where
C
()
k
is nearly singular and/or involves large simulation error (see
Section 6.3.4). This is the special case of the LSPE() class of meth-
ods, corresponding to the special choice (6.89) of G
k
. The LSTD()
method of Eq. (6.155), and the entire class of LSPE()-type iterations
(6.156) converge at the same asymptotic rate, in the sense that
| r
k
r
k
| << | r
k
r
|
for large k. However, depending on the choice of G
k
, the short-term
behavior of the LSPE-type methods is more regular as it involves
implicit regularization through dependence on the initial condition.
This behavior may be an advantage in the policy iteration context
where optimistic variants, involving more noisy iterations, are used.
(e) When the LSTD() and LSPE() methods are exploration-enhanced
for the purpose of embedding within an approximate policy itera-
tion framework, their convergence properties become more compli-
cated: LSTD() and the regularized regression-based version (6.90) of
LSPE() converge to the solution of the corresponding (exploration-
enhanced) projected equation for an arbitrary amount of exploration,
but TD() and other special cases of LSPE() do so only for a limited
amount of exploration and/or for suciently close to 1, as discussed
in Section 6.3.7. On the other hand there is a special least-squares
based exploration-enhanced version of LSPE() that overcomes this
diculty (cf. Section 6.3.6).
(f) The limit r
depends on . The estimate of Prop. 6.3.5 indicates

that the approximation error |J
increases as the distance

|J
from the subspace S becomes larger, and also increases

as becomes smaller. Indeed, the error degradation may be very sig-
nicant for small values of , as shown by an example in [Ber95] (also
reproduced in Exercise 6.9), where TD(0) produces a very bad solu-
tion relative to J
, which is the limit of the solution r
produced by
TD() as 1. (This example involves a stochastic shortest path
problem, but can be modied to illustrate the same conclusion for
discounted problems.) Note, however, that in the context of approxi-
mate policy iteration, the correlation between approximation error in
the cost of the current policy and the performance of the next policy
is somewhat unclear in practice (for example adding a constant to the
cost of the current policy at every state does not aect the result of
the policy improvement step).
(g) As 1, the size of the approximation error |J
tends
to diminish, but the methods become more vulnerable to simulation
noise, and hence require more sampling for good performance. Indeed,
the noise in a simulation sample of an -stages cost vector T
J tends
to be larger as increases, and from the formula
T
()
= (1 )
=0
T
+1
it can be seen that the simulation samples of T
()
(r
k
), used by
LSTD() and LSPE(), tend to contain more noise as increases.
This is consistent with practical experience, which indicates that the
algorithms tend to be faster and more reliable in practice when takes
smaller values (or at least when is not too close to 1). Generally,
there is no rule of thumb for selecting , which is usually chosen with
some trial and error.
(h) TD() is much slower than LSTD() and LSPE() [unless the number
of basis functions s is extremely large, in which case the overhead
for the linear algebra calculations that are inherent in LSTD() and
LSPE() becomes excessive]. This can be traced to TD()s use of
single-sample approximations of C
()
and d
()
, which are much less
accurate than C
()
k
and d
()
k
.
The assumptions under which convergence of LSTD(), LSPE(),
and TD() is usually shown include:
(i) The use of a linear approximation architecture r, with satisfying
the rank Assumption 6.3.2.
(ii) The use for simulation purposes of a Markov chain that has a steady-
state distribution vector with positive components, which denes the
projection norm. This is typically the Markov chain associated with
the policy being evaluated, or an exploration-enhanced variant.
(iii) The use for simulation purposes of a Markov chain that denes a
projection norm with respect to which T
()
is a contraction. This is
important only for some of the methods: TD() and scaled LSPE()
[except for the regularized regression-based version (6.90)].
(iv) The use of a diminishing stepsize in the case of TD(). For LSTD(),
and LSPE() and its regularized regression-based form (6.90), there is
no stepsize choice, and in various cases of scaled versions of LSPE()
the required stepsize is constant.
(v) The use of a single policy, unchanged during the simulation; conver-
gence does not extend to the case where T involves a minimization
over multiple policies, or optimistic variants, where the policy used
to generate the simulation data is changed after a few transitions.
Let us now discuss the above assumptions (i)-(v). Regarding (i), there
are no convergence guarantees for methods that use nonlinear architectures.
In particular, an example in [TsV97] (also replicated in [BeT96], Example
6.6) shows that TD() may diverge if a nonlinear architecture is used. In
the case where does not have rank s, the mapping T
()
will still be a
contraction with respect to ||
, so it has a unique xed point. In this case,

TD() has been shown to converge to some vector r

s
. This vector
is the orthogonal projection of the initial guess r
0
on the set of solutions
of the projected Bellman equation, i.e., the set of all r such that r is the
unique xed point of T
()
; see [Ber09b], [Ber11a]. LSPE() and its scaled
variants can be shown to have a similar property.
Regarding (ii), if we use for simulation a Markov chain whose steady-
state distribution exists but has some components that are 0, the corre-
sponding states are transient, so they will not appear in the simulation
after a nite number of transitions. Once this happens, the algorithms
will operate as if the Markov chain consists of just the recurrent states,
and convergence will not be aected. However, the transient states would
be underrepresented in the cost approximation. A similar diculty occurs
if we use for simulation a Markov chain with multiple recurrent classes.
Then the results of the algorithms would depend on the initial state of
the simulated trajectory (more precisely on the recurrent class of this ini-
tial state). In particular, states from other recurrent classes, and transient
states would be underrepresented in the cost approximation obtained.
Regarding (iii), an example of divergence of TD(0) where the un-
derlying projection norm is such that T is not a contraction is given in
[BeT96] (Example 6.7). Exercise 6.4 gives a similar example. On the other
hand, as noted earlier, T
()
is a contraction for any Euclidean projection
norm, provided is suciently close to 1.
Regarding (iv), the method for stepsize choice is critical for TD();
both for convergence and for performance. This is a major drawback of
TD(), which compounds its practical diculty with slow convergence.
Regarding (v), once minimization over multiple policies is introduced
[so T and T
()
are nonlinear], or optimistic variants are used, the behavior
of the methods becomes quite peculiar and unpredictable because T
()
may not be a contraction. For instance, there are examples where T
()
has no xed point, and examples where it has multiple xed points; see
Similar to Prop. 6.3.5, it can be shown that T
()
is a sup-norm contraction
with modulus that tends to 0 as 1. It follows that given any projection
norm , T
()
and T
()
are contractions with respect to , provided is
suciently close to 1.
Sec. 6.4 Aggregation Methods 425
[BeT96] (Example 6.9), and [DFV00]. Generally, the issues associated
with policy oscillations, the chattering phenomenon, and the asymptotic
behavior of nonoptimistic and optimistic approximate policy iteration, are
not well understood. Figures 6.3.4 and 6.3.5 suggest their enormously
complex nature: the points where subsets in the greedy partition join are
potential points of attraction of the various algorithms.
On the other hand, even in the case where T
()
is nonlinear, if T
()
is a contraction, it has a unique xed point, and the peculiarities associated
with chattering do not arise. In this case the scaled PVI() iteration [cf.
Eq. (6.44)] takes the form
r
k+1
= r
k
G
_
r
k
T
()
(r
k
)
_
,
where G is a scaling matrix, and is a positive stepsize that is small enough
to guarantee convergence. As discussed in [Ber09b], [Ber11a], this iteration
converges to the unique xed point of T
()
, provided the constant stepsize
is suciently small. Note that there are limited classes of problems,
involving multiple policies, where the mapping T
()
is a contraction. An
example, optimal stopping problems, is discussed in Sections 6.5.3 and
6.8.3. Finally, let us note that the LSTD() method relies on the linearity
of the mapping T, and it has no practical generalization for the case where
T is nonlinear.
6.4 AGGREGATION METHODS
In this section we revisit the aggregation methodology discussed in Section
6.3.4 of Vol. I, viewing it now in the context of policy evaluation with cost
function approximation for discounted DP. The aggregation approach re-
sembles in some ways the problem approximation approach discussed in
Section 6.3.3 of Vol. I: the original problem is approximated with a related
aggregate problem, which is then solved exactly to yield a cost-to-go
approximation for the original problem. Still, in other ways the aggrega-
tion approach resembles the projected equation/subspace approximation
approach, most importantly because it constructs cost approximations of
the form r, i.e., linear combinations of basis functions. However, there
are important dierences: in aggregation methods there are no projections
with respect to Euclidean norms, the simulations can be done more ex-
ibly, and from a mathematical point of view, the underlying contractions
are with respect to the sup-norm rather than a Euclidean norm.
Aggregation may be used in conjunction with any Bellman equation asso-
ciated with the given problem. For example, if the problem admits post-decision
states (cf. Section 6.1.4), the aggregation may be done using the correspond-
ing Bellman equation, with potentially signicant simplications resulting in the
algorithms of this section.
To construct an aggregation framework, we introduce a nite set /
of aggregate states, and we introduce two (somewhat arbitrary) choices of
probabilities, which relate the original system states with the aggregate
states:
(1) For each aggregate state x and original system state i, we specify
the disaggregation probability d
xi
[we have
n
i=1
d
xi
= 1 for each
x /]. Roughly, d
xi
may be interpreted as the degree to which x is
represented by i.
(2) For each aggregate state y and original system state j, we specify
the aggregation probability
jy
(we have
yA
jy
= 1 for each j =
1, . . . , n). Roughly,
jy
may be interpreted as the degree of member-
ship of j in the aggregate state y. The vectors
jy
[ j = 1, . . . , n
may also be viewed as basis functions that will be used to represent
approximations of the cost vectors of the original problem.
Let us mention a few examples:
(a) In hard and soft aggregation (Examples 6.3.9 and 6.3.10 of Vol. I),
we group the original system states into subsets, and we view each
subset as an aggregate state. In hard aggregation each state belongs
to one and only one subset, and the aggregation probabilities are
jy
= 1 if system state j belongs to aggregate state/subset y.
One possibility is to choose the disaggregation probabilities as
d
xi
= 1/n
x
if system state i belongs to aggregate state/subset x,
where n
x
is the number of states of x (this implicitly assumes that all
states that belong to aggregate state/subset y are equally represen-
tative). In soft aggregation, we allow the aggregate states/subsets to
overlap, with the aggregation probabilities
jy
quantifying the de-
gree of membership of j in the aggregate state/subset y. The selec-
tion of aggregate states in hard and soft aggregation is an important
issue, which is not fully understood at present. However, in specic
practical problems, based on intuition and problem-specic knowl-
edge, there are usually evident choices, which may be ne-tuned by
experimentation.
(b) In various discretization schemes, each original system state j is as-
sociated with a convex combination of aggregate states:
j
yA
jy
y,
for some nonnegative weights
jx
, whose sum is 1, and which are
viewed as aggregation probabilities (this makes geometrical sense if
both the original and the aggregate states are associated with points
in a Euclidean space, as described in Example 6.3.13 of Vol. I).
(c) In coarse grid schemes (cf. Example 6.3.12 of Vol. I and the subse-
quent example in Section 6.4.1), a subset of representative states is
chosen, each being an aggregate state. Thus, each aggregate state
x is associated with a unique original state i
x
, and we may use the
disaggregation probabilities d
xi
= 1 for i = i
x
and d
xi
= 0 for i ,= i
x
.
The aggregation probabilities are chosen as in the preceding case (b).
The aggregation approach approximates cost vectors with r, where
r
s
is a weight vector to be determined, and is the matrix whose jth
row consists of the aggregation probabilities
j1
, . . . ,
js
. Thus aggrega-
tion involves an approximation architecture similar to the one of projected
equation methods: it uses as features the aggregation probabilities. Con-
versely, starting from a set of s features for each state, we may construct
a feature-based hard aggregation scheme by grouping together states with
similar features. In particular, we may use a more or less regular par-
tition of the feature space, which induces a possibly irregular partition of
the original state space into aggregate states (all states whose features fall
in the same set of the feature partition form an aggregate state). This is
a general approach for passing from a feature-based approximation of the
cost vector to an aggregation-based approximation (see also [BeT96], Sec-
tion 3.1.2). Unfortunately, in the resulting aggregation scheme the number
of aggregate states may become very large.
The aggregation and disaggregation probabilities specify a dynamical
system involving both aggregate and original system states (cf. Fig. 6.4.1).
In this system:
(i) From aggregate state x, we generate original system state i according
to d
xi
.
(ii) We generate transitions from original system state i to original system
state j according to p
ij
(u), with cost g(i, u, j).
(iii) From original system state j, we generate aggregate state y according
to
jy
.
One may associate various DP problem formulations with this system,
thereby obtaining two types of alternative cost approximations.
(a) In the rst approximation, discussed in Section 6.4.1, the focus is on
the aggregate states, the role of the original system states being to
dene the mechanisms of cost generation and probabilistic transition
from one aggregate state to the next. This approximation may lead to
small-sized aggregate problems that can be solved by ordinary value
and policy iteration methods, even if the number of original system
states is very large.
according to p
ij
(u), with cost
d
xi
S
jy
Q
, j = 1 i
), x ), y
Original System States Aggregate States
|
p
xy
(u) =
n
i=1
d
xi
n
j=1
p
ij
(u)
jy
,
Disaggregation Probabilities
Aggregation Probabilities
Figure 6.4.1 Illustration of the transition mechanism of a dynamical system
involving both aggregate and original system states.
(b) In the second approximation, discussed in Section 6.4.2, the focus is
on both the original system states and the aggregate states, which
together are viewed as states of an enlarged system. Policy and value
iteration algorithms are then dened for this enlarged system. For a
large number of original system states, this approximation requires a
simulation-based implementation.
6.4.1 Cost Approximation via the Aggregate Problem
Here we formulate an aggregate problem where the control is applied with
knowledge of the aggregate state (rather than the original system state).
To this end, we assume that the control constraint set U(i) is independent
of the state i, and we denote it by U. Then, the transition probability from
aggregate state x to aggregate state y under control u, and the correspond-
ing expected transition cost, are given by (cf. Fig. 6.4.1)
p
xy
(u) =
n
i=1
d
xi
n
j=1
p
ij
(u)
jy
, g(x, u) =
n
i=1
d
xi
n
j=1
p
ij
(u)g(i, u, j).
(6.157)
These transition probabilities and costs dene an aggregate problem whose
states are just the aggregate states.
The optimal cost function of the aggregate problem, denoted

J, is
obtained as the unique solution of Bellmans equation
J(x) = min
uU
_
_
g(x, u) +
yA
p
xy
(u)

J(y)
_
_
, x.
This equation has dimension equal to the number of aggregate states, and
can be solved by any of the available value and policy iteration methods,
including ones that involve simulation. Once

J is obtained, the optimal
cost function J
of the original problem is approximated by

J given by
J(j) =
yA
jy

J(y), j,
which is used for one-step lookahead in the original system; i.e., a subop-
timal policy is obtained through the minimization
(i) = arg min
uU(i)
n
j=1
p
ij
(u)
_
g(i, u, j) +

J(j)
_
, i = 1, . . . , n.
Note that for an original system state j, the approximation

J(j) is a
convex combination of the costs

J(y) of the aggregate states y for which
jy
> 0. In the case of hard aggregation,

J is piecewise constant: it assigns
the same cost to all the states j that belong to the same aggregate state y
(since
jy
= 1 if j belongs to y and
jy
= 0 otherwise).
The preceding scheme can also be applied to problems with innite
state space, and is well-suited for approximating the solution of partially ob-
served Markov Decision problems (POMDP), which are dened over their
belief space (space of probability distributions over their states, cf. Section
5.4.2 of Vol. I). By discretizing the belief space with a coarse grid, one
obtains a nite spaces (aggregate) DP problem of perfect state information
that can be solved with the methods of Chapter 1 (see [ZhL97], [ZhH01],
[YuB04]). The following example illustrates the main ideas and shows that
in the POMDP case, where the optimal cost function is a concave function
over the simplex of beliefs (see Vol. I, Section 5.4.2), the approximation
obtained is a lower bound of the optimal cost function.
Example 6.4.1 (Coarse Grid/POMDP Discretization and
Lower Bound Approximations)
Consider an -discounted DP problem with bounded cost per stage (cf. Sec-
tion 1.2), where the state space is a convex subset C of a Euclidean space.
We use z to denote the elements of this space, to distinguish them from x
which now denotes aggregate states. Bellmans equation is J = TJ with T
dened by
(TJ)(z) = min
uU
E
w
_
g(z, u, w) +J
_
f(z, u, w)
__
, z C.
Let J
denote the optimal cost function. We select a nite subset/coarse grid

of states {x1, . . . , xm} C, whose convex hull is C. Thus each state z C
can be expressed as
z =
m
i=1
zx
i
xi,
where for each z, zx
i
0, i = 1, . . . , m, and
m
i=1
zx
i
= 1. We view
{x1, . . . , xm} as aggregate states with aggregation probabilities zx
i
, i =
1, . . . , m, for each z C. The disaggregation probabilities are dx
k
i = 1
for i = x
k
and dx
k
i = 0 for i = x
k
, k = 1, . . . , m. Consider the mapping

T
dened by
(
TJ)(z) = min
uU
E
w
_
g(z, u, w) +
m
j=1
f(z,u,w) x
j
J(xj)
_
, z C,
where
f(z,u,w) x
j
are the aggregation probabilities of the next state f(z, u, w).
We note that

T is a contraction mapping with respect to the sup-norm.
Let

J denotes its unique xed point, so that we have
J(xi) = (
T

J)(xi), i = 1, . . . , m.
This is Bellmans equation for an aggregated nite-state discounted DP prob-
lem whose states are x1, . . . , xm, and can be solved by standard value and
policy iteration methods that need not use simulation. We approximate the
optimal cost function of the original problem by
J(z) =
m
i=1
zx
i
J(xi), z C.
Suppose now that J
is a concave function over C (as in the POMDP

case, where J
is the limit of the nite horizon optimal cost functions that

are concave, as shown in Vol. I, Section 5.4.2). Then for all (z, u, w), since
f(z,u,w) x
j
, j = 1, . . . , m, are probabilities that add to 1, we have
J
_
f(z, u, w)
_
= J
_
m
i=1
f(z,u,w) x
i
xi
_
i=1
f(z,u,w) x
i
J
(xi);
this is a consequence of the denition of concavity and is also known as
Jensens inequality (see e.g., [Ber09a]). It then follows from the denitions of
T and

T that
J
(z) = (TJ
)(z) (
TJ
)(z), z C,
so by iterating, we see that
J
(z) lim
k
(
T
k
J
)(z) =

J(z), z C,
where the last equation follows because

T is a contraction, and hence

T
k
J
must converge to the unique xed point

J of

T. For z = xi, we have in
particular
J
(xi)

J(xi), i = 1, . . . , m,
from which we obtain for all z C,
J
(z) = J
_
m
i=1
zx
i
xi
_
i=1
zx
i
J
(xi)
m
i=1
zx
i
J(xi) =

J(z),
where the rst inequality follows from the concavity of J
. Thus the approx-

imation

J(z) obtained from the aggregate system provides a lower bound to
J
(z). Similarly, if J
can be shown to be convex, the preceding argument

can be modied to show that

J(z) is an upper bound to J
(z).
6.4.2 Cost Approximation via the Enlarged Problem
The approach of the preceding subsection calculates cost approximations
assuming that policies assign controls to aggregate states, rather than to
states of the original system. Thus, for example, in the case of hard ag-
gregation, the calculations assume that the same control will be applied to
every original system state within a given aggregate state. We will now
discuss an alternative approach that is not subject to this limitation. Let
us consider the system consisting of the original states and the aggregate
states, with the transition probabilities and the stage costs described earlier
(cf. Fig. 6.4.1). We introduce the vectors

J
0
,

J
1
, and R
where:
R
(x) is the optimal cost-to-go from aggregate state x.
J
0
(i) is the optimal cost-to-go from original system state i that has
just been generated from an aggregate state (left side of Fig. 6.4.1).
J
1
(j) is the optimal cost-to-go from original system state j that has
just been generated from an original system state (right side of Fig.
6.4.1).
Note that because of the intermediate transitions to aggregate states,

J
0
and

J
1
are dierent.
These three vectors satisfy the following three Bellmans equations:
R
(x) =
n
i=1
d
xi

J
0
(i), x /, (6.158)
J
0
(i) = min
uU(i)
n
j=1
p
ij
(u)
_
g(i, u, j) +

J
1
(j)
_
, i = 1, . . . , n, (6.159)
J
1
(j) =
yA
jy
R
(y), j = 1, . . . , n. (6.160)
By combining these equations, we obtain an equation for R
:
R
(x) = (FR
)(x), x /,
where F is the mapping dened by
(FR)(x) =
n
i=1
d
xi
min
uU(i)
n
j=1
p
ij
(u)
_
_
g(i, u, j) +
yA
jy
R(y)
_
_
, x /.
(6.161)
It can be seen that F is a sup-norm contraction mapping and has R
as
its unique xed point. This follows from standard contraction arguments
(cf. Prop. 1.2.4) and the fact that d
xi
, p
ij
(u), and
jy
are all transition
probabilities.
Once R
is found, the optimal-cost-to-go of the original problem may

be approximated by

J
1
= R
, and a suboptimal policy may be found

through the minimization (6.159) that denes

J
0
. Again, the optimal cost
function approximation

J
1
is a linear combination of the columns of ,
which may be viewed as basis functions.
Value and Policy Iteration
One may use value and policy iteration-type algorithms to nd R
. The
value iteration algorithm simply generates successively FR, F
2
R, . . ., start-
ing with some initial guess R. The policy iteration algorithm starts with
a stationary policy
0
for the original problem, and given
k
, it nds R
k
satisfying R
k = F
k R
k , where F
is the mapping dened by

(F
R)(x) =
n
i=1
d
xi
n
j=1
p
ij
_
(i)
_
_
_
g
_
i, (i), j
_
+
yA
jy
R
(y)
_
_
, x /,
(6.162)
(this is the policy evaluation step). It then generates
k+1
by
k+1
(i) = arg min
uU(i)
n
j=1
p
ij
(u)
_
_
g(i, u, j) +
yA
jy
R
k(y)
_
_
, i,
(6.163)
A quick proof is to observe that F is the composition
F = DT,
where T is the usual DP mapping, and D and are the matrices with rows the
disaggregation and aggregation distributions, respectively. Since T is a contrac-
tion with respect to the sup-norm, and D and are sup-norm nonexpansive
in the sense
Dx x, x
n
,
y y, y
s
,
it follows that F is a sup-norm contraction.
(this is the policy improvement step). Based on the discussion in Section
6.3.8 and Prop. 6.3.8, this policy iteration algorithmconverges to the unique
xed point of F in a nite number of iterations. The key fact here is that F
and F
are not only sup-norm contractions, but also have the monotonicity
property of DP mappings (cf. Section 1.1.2 and Lemma 1.1.1), which was
used in an essential way in the convergence proof of ordinary policy iteration
(cf. Prop. 1.3.4).
As discussed in Section 6.3.8, when the policy sequence
k
con-
verges to some as it does here, we have the error bound
|J

2
1
, (6.164)
where satises
|J
k
J
k |
,
for all generated policies
k
and J
k
is the approximate cost vector of
k
that
is used for policy improvement (which is R
k in the case of aggregation).

This is much sharper than the error bound
limsup
k
|J
k J

2
(1 )
2
,
of Prop. 1.3.6.
The preceding error bound improvement suggests that approximate
policy iteration based on aggregation may hold some advantage in terms of
approximation quality, relative to its projected equation-based counterpart.
For a generalization of this idea, see Exercise 6.15. The price for this, how-
ever, is that the basis functions in the aggregation approach are restricted
by the requirement that the rows of must be probability distributions.
Simulation-Based Policy Iteration
The policy iteration method just described requires n-dimensional calcula-
tions, and is impractical when n is large. An alternative, which is consistent
with the philosophy of this chapter, is to implement it by simulation, using
a matrix inversion/LSTD-type method, as we now proceed to describe.
For a given policy , the aggregate version of Bellmans equation,
R = F
R, is linear of the form [cf. Eq. (6.162)]

R = DT
(R),
where D and are the matrices with rows the disaggregation and aggrega-
tion distributions, respectively, and T
is the DP mapping associated with

, i.e.,
T
J = g
+P
J,
with P
the transition probability matrix corresponding to , and g
is the
vector whose ith component is
n
j=1
p
ij
_
(i)
_
g
_
i, (i), j
_
.
We can thus write this equation as
ER = f,
where
E = I DP, f = Dg, (6.165)
in analogy with the corresponding matrix and vector for the projected
equation [cf. Eq. (6.41)].
We may use low-dimensional simulation to approximate E and f
based on a given number of samples, similar to Section 6.3.3 [cf. Eqs.
(6.48) and (6.49)]. In particular, a sample sequence
_
(i
0
, j
0
), (i
1
, j
1
), . . .
_
is obtained by rst generating a sequence of states i
0
, i
1
, . . . by sampling
according to a distribution
i
[ i = 1, . . . , n (with
i
> 0 for all i), and
then by generating for each t the column index j
t
using sampling according
to the distribution p
i
t
j
[ j = 1, . . . , n. Given the rst k + 1 samples, we
form the matrix

E
k
and vector

f
k
given by
E
k
= I

k + 1
k
t=0
1
i
t
d(i
t
)(j
t
)
,

f
k
=
1
k + 1
k
t=0
1
i
t
d(i
t
)g
_
i
t
, (i
t
), j
t
_
,
(6.166)
where d(i) is the ith column of D and (j)
is the jth row of . The

convergence

E
k
E and

f
k
f follows from the expressions
E = I
n
i=1
n
j=1
p
ij
_
(i)
_
d(i)(j)
, f =
n
i=1
n
j=1
p
ij
_
(i)
_
d(i)g
_
i, (i), j
_
,
the relation
lim
k
k
t=0
(i
t
= i, j
t
= j)
k + 1
=
i
p
ij
,
and law of large numbers arguments (cf. Section 6.3.3).
It is important to note that the sampling probabilities
i
are restricted
to be positive, but are otherwise arbitrary and need not depend on the cur-
rent policy. Moreover, their choice does not aect the obtained approximate
solution of the equation ER = f. Because of this possibility, the problem
of exploration is less acute in the context of policy iteration when aggre-
gation is used for policy evaluation. This is in contrast with the projected
equation approach, where the choice of
i
aects the projection norm and
the solution of the projected equation, as well as the contraction properties
of the mapping T.
Note also that instead of using the probabilities
i
to sample orig-
inal system states, we may alternatively sample the aggregate states x
according to a distribution
x
[ x /, generate a sequence of aggregate
states x
0
, x
1
, . . ., and then generate a state sequence i
0
, i
1
, . . . using
the disaggregation probabilities. In this case the equations (6.166) should
be modied as follows:
E
k
= I

k + 1
k
t=0
1
x
t
d
x
t
i
t
d(i
t
)(j
t
)
f
k
=
1
k + 1
k
t=0
1
x
t
d
x
t
i
t
d(i
t
)g
_
i
t
, (i
t
), j
t
_
.
The corresponding matrix inversion/LSTD-type method generates
R
k
=

E
1
k
f
k
, and approximates the cost vector of by the vector

R
k
:
=

R
k
.
There is also a regression-based version that is suitable for the case where
E
k
is nearly singular (cf. Section 6.3.4), as well as an iterative regression-
based version of LSTD, which may be viewed as a special case of (scaled)
LSPE. The latter method takes the form
R
k+1
= (

E
1
k
E
k
+I)
1
(

E
1
k
f
k
+

R
k
), (6.167)
where > 0 and
k
is a positive denite symmetric matrix [cf. Eq. (6.75)].
Note that contrary to the projected equation case, for a discount factor
1,

E
k
will always be nearly singular [since DP is a transition probability
matrix, cf. Eq. (6.165)].
The nonoptimistic version of this aggregation-based policy iteration
method does not exhibit the oscillatory behavior of the one based on the
projected equation approach (cf. Section 6.3.8): the generated policies con-
verge and the limit policy satises the sharper error bound (6.164), as noted
earlier. Moreover, optimistic versions of the method also do not exhibit the
chattering phenomenon described in Section 6.3.8. This is similar to opti-
mistic policy iteration for the case of a lookup table representation of the
cost of the current policy: we are essentially dealing with a lookup table
representation of the cost of the aggregate system of Fig. 6.4.1.
The preceding arguments indicate that aggregation-based policy iter-
ation holds an advantage over its projected equation-based counterpart in
terms of regularity of behavior, error guarantees, and exploration-related
diculties. Its limitation is that the basis functions in the aggregation
approach are restricted by the requirement that the rows of must be
probability distributions. For example in the case of a single basis function
(s = 1), there is only one possible choice for in the aggregation context,
namely the matrix whose single column is the unit vector.
Simulation-Based Value Iteration
The value iteration algorithm also admits a simulation-based implemen-
tation. It generates a sequence of aggregate states x
0
, x
1
, . . . by some
probabilistic mechanism, which ensures that all aggregate states are gener-
ated innitely often. Given each x
k
, it independently generates an original
system state i
k
according to the probabilities d
x
k
i
, and updates R(x
k
) ac-
cording to
R
k+1
(x
k
) = (1
k
)R
k
(x
k
)
+
k
min
uU(i)
n
j=1
p
i
k
j
(u)
_
_
g(i
k
, u, j) +
yA
jy
R
k
(y)
_
_
,
(6.168)
where
k
is a diminishing positive stepsize, and leaves all the other com-
ponents of R unchanged:
R
k+1
(x) = R
k
(x), if x ,= x
k
.
This algorithm can be viewed as an asynchronous stochastic approximation
version of value iteration. Its convergence mechanism and justication are
very similar to the ones to be given for Q-learning in Section 6.5.1. It is
often recommended to use a stepsize
k
that depends on the state x
k
being
iterated on, such as for example
k
= 1/
_
1 + n(x
k
)
_
, where n(x
k
) is the
number of times the state x
k
has been generated in the simulation up to
time k.
Multistep Aggregation
The aggregation methodology of this section can be generalized by con-
sidering a multistep aggregation-based dynamical system. This system,
illustrated in Fig. 6.4.2, is specied by disaggregation and aggregation prob-
abilities as before, but involves k > 1 transitions between original system
states in between transitions from and to aggregate states.
We introduce vectors

J
0
,

J
1
, . . . ,

J
k
, and R
where:
R
(x) is the optimal cost-to-go from aggregate state x.
J
0
(i) is the optimal cost-to-go from original system state i that has
just been generated from an aggregate state (left side of Fig. 6.4.2).
J
1
(j
1
) is the optimal cost-to-go from original system state j that has
just been generated from an original system state i.
S
jyQ
dxi
|
according to pij (u), with cost according to pij (u), with cost according to pij (u), with cost
i
), x ), y
Stages j1 j1 j2 j2 j
k
k Stages
Figure 6.4.2 The transition mechanism for multistep aggregation. It is based on
a dynamical system involving aggregate states, and k transitions between original
system states in between transitions from and to aggregate states.
J
m
(j
m
), m = 2, . . . , k, is the optimal cost-to-go from original system
state j
m
that has just been generated from an original system state
j
m1
.
These vectors satisfy the following Bellmans equations:
R
(x) =
n
i=1
d
xi

J
0
(i), x /,
J
0
(i) = min
uU(i)
n
j
1
=1
p
ij
1
(u)
_
g(i, u, j
1
) +

J
1
(j
1
)
_
, i = 1, . . . , n, (6.169)
J
m
(j
m
) = min
uU(jm)
n
j
m+1
=1
p
jmj
m+1
(u)
_
g(j
m
, u, j
m+1
) +

J
m+1
(j
m+1
)
_
,
j
m
= 1, . . . , n, m = 1, . . . , k 1,
(6.170)
J
k
(j
k
) =
yA
j
k
y
R
(y), j
k
= 1, . . . , n. (6.171)
By combining these equations, we obtain an equation for R
:
R
(x) = (FR
)(x), x /,
where F is the mapping dened by
FR = DT
k
(R),
where T is the usual DP mapping of the problem. As earlier, it can be seen
that F is a sup-norm contraction, but its contraction modulus is
k
rather
than .
There is a similar mapping corresponding to a xed policy and it
can be used to implement a policy iteration algorithm, which evaluates a
policy through calculation of a corresponding vector R and then improves
it. However, there is a major dierence from the single-step aggregation
case: a policy involves a set of k control functions
0
, . . . ,
k1
, and
while a known policy can be easily simulated, its improvement involves
multistep lookahead using the minimizations of Eqs. (6.169)-(6.171), and
may be costly. Thus multistep aggregation is a useful idea only for problems
where the cost of this multistep lookahead minimization (for a single given
starting state) is not prohibitive. On the other hand, note that from the
theoretical point of view, a multistep scheme provides a means of better
approximation of the true optimal cost vector J
, independent of the use

of a large number of aggregate states. This can be seen from Eqs. (6.169)-
(6.171), which by classical value iteration convergence results, show that
J
0
(i) J
(i) as k , regardless of the choice of aggregate states.

Asynchronous Distributed Aggregation
Let us now discuss the distributed solution of large-scale discounted DP
problems using cost function approximation based on hard aggregation.
We partition the original system states into aggregate states/subsets x
/ = x
1
, . . . , x
m
, and we envision a network of processors, each updating
asynchronously a detailed/exact local cost function, dened on a single ag-
gregate state/subset. Each processor also maintains an aggregate cost for
its aggregate state, which is the weighted average detailed cost of the (orig-
inal system) states in the processors subset, weighted by the corresponding
disaggregation probabilities. These aggregate costs are communicated be-
tween processors and are used to perform the local updates.
In a synchronous value iteration method of this type, each proces-
sor a = 1, . . . , m, maintains/updates a (local) cost J(i) for every original
system state i x
a
, and an aggregate cost
R(a) =
ixa
d
xai
J(i),
with d
xai
being the corresponding disaggregation probabilities. We generi-
cally denote by J and R the vectors with components J(i), i = 1, . . . , n, and
R(a), a = 1, . . . , m, respectively. These components are updated according
to
J
k+1
(i) = min
uU(i)
H
a
(i, u, J
k
, R
k
), i x
a
, (6.172)
with
R
k
(a) =
ixa
d
xai
J
k
(i), a = 1, . . . , m, (6.173)
where the mapping H
a
is dened for all a = 1, . . . , m, i x
a
, u U(i),
and J
n
, R
m
, by
H
a
(i, u, J, R) =
n
j=1
p
ij
(u)g(i, u, j)+
jxa
p
ij
(u)J(j)+
j / xa
p
ij
(u)R
_
x(j)
_
,
(6.174)
and where for each original system state j, we denote by x(j) the subset
to which i belongs [i.e., j x(j)]. Thus the iteration (6.172) is the same as
ordinary value iteration, except that the aggregate costs R
_
x(j)
_
are used
for states j whose costs are updated by other processors.
It is possible to show that the iteration (6.172)-(6.173) involves a
sup-norm contraction mapping of modulus , so it converges to the unique
solution of the system of equations in (J, R)
J(i) = min
uU(i)
H
a
(i, u, J, R), R(a) =
ixa
d
xai
J(i),
i x
a
, a = 1, . . . , m;
(6.175)
This follows from the fact that d
xai
[ i = 1, . . . , n is a probability distri-
bution. We may view the equations (6.175) as a set of Bellman equations
for an aggregate DP problem, which similar to our earlier discussion,
involves both the original and the aggregate system states. The dierence
from the Bellman equations (6.158)-(6.160) is that the mapping (6.174)
involves J(j) rather than R
_
x(j)
_
for j x
a
.
In the algorithm (6.172)-(6.173), all processors a must be updating
their local costs J(i) and aggregate costs R(a) synchronously, and commu-
nicate the aggregate costs to the other processors before a new iteration
may begin. This is often impractical and time-wasting. In a more prac-
tical asynchronous version of the method, the aggregate costs R(a) may
be outdated to account for communication delays between processors.
Moreover, the costs J(i) need not be updated for all i; it is sucient that
they are updated by each processor a only for a (possibly empty) subset of
I
a,k
x
a
. In this case, the iteration (6.172)-(6.173) is modied to take the
form
J
k+1
(i) = min
uU(i)
H
a
_
i, u, J
k
, R
1,k
(1), . . . , R
m,k
(m)
_
, i I
a,k
,
(6.176)
with 0
a,k
k for a = 1, . . . , m, and
R
(a) =
ixa
d
xai
J
(i), a = 1, . . . , m. (6.177)
The dierences k
a,k
, a = 1, . . . , m, in Eq. (6.176) may be viewed as de-
lays between the current time k and the times
a,k
when the corresponding
aggregate costs were computed at other processors. For convergence, it is
of course essential that every i x
a
belongs to I
a,k
for innitely many k (so
each cost component is updated innitely often), and lim
k

a,k
= for
all a = 1, . . . , m (so that processors eventually communicate more recently
computed aggregate costs to other processors).
Asynchronous distributed DP methods of this type have been pro-
posed and analyzed by the author in [Ber82]. Their convergence, based on
the sup-norm contraction property of the mapping underlying Eq. (6.175),
has been established in [Ber82] (see also [Ber83]). The monotonicity prop-
erty is also sucient to establish convergence, and this may be useful in
the convergence analysis of related algorithms for other nondiscounted DP
models. We also mention that asynchronous distributed policy iteration
methods have been developed recently (see [BeY10b]).
6.5 Q-LEARNING
We now introduce another method for discounted problems, which is suit-
able for cases where there is no explicit model of the system and the cost
structure (a model-free context). The method is related to value itera-
tion and can be used directly in the case of multiple policies. Instead of
approximating the cost function of a particular policy, it updates the Q-
factors associated with an optimal policy, thereby avoiding the multiple
policy evaluation steps of the policy iteration method.
In the discounted problem, the Q-factors are dened, for all pairs
(i, u) with u U(i), by
Q
(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) +J
(j)
_
.
Using Bellmans equation, we see that the Q-factors satisfy for all (i, u),
Q
(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) + min
vU(j)
Q
(j, v)
_
, (6.178)
and can be shown to be the unique solution of this set of equations. The
proof is essentially the same as the proof of existence and uniqueness of
the solution of Bellmans equation. In fact, by introducing a system whose
states are the original states 1, . . . , n, together with all the pairs (i, u),
the above set of equations can be seen to be a special case of Bellmans
equation (see Fig. 6.5.1). The Q-factors can be obtained by the value
iteration Q
k+1
= FQ
k
, where F is the mapping dened by
(FQ)(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) + min
vU(j)
Q(j, v)
_
, (i, u).
(6.179)
Sec. 6.5 Q-Learning 441
State-Control Pairs (
i, u) States
) States j p
j p
ij
(u)
) g(i, u, j)
i, u, j) v
State-Control Pairs (j, v) States
) States
) States j p
j p
ij
(u)
) g(i, u, j)
v (j)
j)
j, (j)
State-Control Pairs: Fixed Policy Case (

Figure 6.5.1 Modied problem where the state-control pairs (i, u) are viewed
as additional states. The bottom gure corresponds to a xed policy . The
transitions from (i, u) to j are according to transition probabilities p
ij
(u) and
incur a cost g(i, u, j). Once the control v is chosen, the transitions from j to (j, v)
occur with probability 1 and incur no cost.
Since F is a sup-norm contraction with modulus (it corresponds to Bell-
mans equation for an -discounted problem), this iteration converges from
every starting point Q
0
.
The Q-learning algorithm is an approximate version of value itera-
tion, whereby the expected value in Eq. (6.179) is suitably approximated
by sampling and simulation. In particular, an innitely long sequence of
state-control pairs (i
k
, u
k
) is generated according to some probabilistic
mechanism. Given the pair (i
k
, u
k
), a state j
k
is generated according to
the probabilities p
i
k
j
(u
k
). Then the Q-factor of (i
k
, u
k
) is updated using a
stepsize
k
> 0 while all other Q-factors are left unchanged:
Q
k+1
(i, u) =
_
(1
k
)Q
k
(i, u) +
k
(F
k
Q
k
)(i, u) if (i, u) = (i
k
, u
k
),
Q
k
(i, u) if (i, u) = (i
k
, u
k
),
where
(F
k
Q
k
)(i
k
, u
k
) = g(i
k
, u
k
, j
k
) + min
vU(j
k
)
Q
k
(j
k
, v).
Equivalently,
Q
k+1
(i, u) = (1
k
)Q
k
(i, u) +
k
(F
k
Q
k
)(i, u), (i, u), (6.180)
where
(F
k
Q
k
)(i, u) =
_
g(i
k
, u
k
, j
k
) +min
vU(j
k
)
Q
k
(j
k
, v) if (i, u) = (i
k
, u
k
),
Q
k
(i, u) if (i, u) ,= (i
k
, u
k
).
(6.181)
To guarantee the convergence of the algorithm (6.180)-(6.181) to the
optimal Q-factors, some conditions must be satised. Chief among them
are that all state-control pairs (i, u) must be generated innitely often
within the innitely long sequence (i
k
, u
k
), and that the successor states
j must be independently sampled at each occurrence of a given state-control
pair. Furthermore, the stepsize
k
should be diminishing to 0 at an appro-
priate rate, as we will discuss shortly.
Q-Learning and Aggregation
Let us also consider the use of Q-learning in conjunction with aggregation,
involving a set / of aggregate states, disaggregation probabilities d
ix
and
aggregation probabilities
jy
. The Q-factors

Q(x, u), x /, u U, of the
aggregate problem of Section 6.4.1 are the unique solution of the Q-factor
equation
Q(x, u) = g(x, u) +
yA
p
xy
(u) min
vU
Q(y, v)
=
n
i=1
d
xi
n
j=1
p
ij
(u)
_
_
g(i, u, j) +
yA
jy
min
vU
Q(y, v)
_
_
,
(6.182)
[cf. Eq. (6.157)]. We may apply Q-learning to solve this equation. In par-
ticular, we generate an innitely long sequence of pairs (x
k
, u
k
) /U
according to some probabilistic mechanism. For each (x
k
, u
k
), we generate
an original system state i
k
according to the disaggregation probabilities
d
x
k
i
, and then a successor state j
k
according to probabilities p
i
k
j
k
(u
k
). We
nally generate an aggregate system state y
k
using the aggregation prob-
abilities
j
k
y
. Then the Q-factor of (x
k
, u
k
) is updated using a stepsize
k
> 0 while all other Q-factors are left unchanged [cf. Eqs. (6.180)-(6.181)]:
Q
k+1
(x, u) = (1
k
)

Q
k
(x, u) +
k
(F
k

Q
k
)(x, u), (x, u), (6.183)
where the vector F
k

Q
k
is dened by
(F
k

Q
k
)(x, u) =
_
g(i
k
, u
k
, j
k
) +min
vU

Q
k
(y
k
, v) if (x, u) = (x
k
, u
k
),
Q
k
(x, u) if (x, u) ,= (x
k
, u
k
).
Note that the probabilistic mechanism by which the pairs (x
k
, u
k
) are
generated is arbitrary, as long as all possible pairs are generated innitely
often. In practice, one may wish to use the aggregation and disaggregation
probabilities, and the Markov chain transition probabilities in an eort to
ensure that important state-control pairs are not underrepresented in the
simulation.
After solving for the Q-factors

Q, the Q-factors of the original problem
are approximated by
Q(j, v) =
yA
jy

Q(y, v), j = 1, . . . , n, v U. (6.184)
We recognize this as an approximate representation

Q of the Q-factors of
the original problem in terms of basis functions. There is a basis function
for each aggregate state y / (the vector
jy
[ j = 1, . . . , n), and the
corresponding coecients that weigh the basis functions are the Q-factors
of the aggregate problem

Q(y, v), y /, v U (so we have in eect a
lookup table representation with respect to v). The optimal cost-to-go
function of the original problem is approximated by
J(j) = min
vU
Q(j, v), j = 1, . . . , n,
and the corresponding one-step lookahead suboptimal policy is obtained as
(i) = arg min
uU
n
j=1
p
ij
(u)
_
g(i, u, j) +

J(j)
_
, i = 1, . . . , n.
Note that the preceding minimization requires knowledge of the transition
probabilities p
ij
(u), which is unfortunate since a principal motivation of
Q-learning is to deal with model-free situations where the transition prob-
abilities are not explicitly known. The alternative is to obtain a suboptimal
control at j by minimizing over v U the Q-factor

Q(j, v) given by Eq.
(6.184). This is less discriminating in the choice of control; for example in
the case of hard aggregation, it applies the same control at all states j that
belong to the same aggregate state y.
6.5.1 Convergence Properties of Q-Learning
We will explain the convergence properties of Q-learning, by viewing it as
an asynchronous value iteration algorithm, where the expected value in the
denition (6.179) of the mapping F is approximated via a form of Monte
Carlo averaging. In the process we will derive some variants of Q-learning
that may oer computational advantages in some situations.
Much of the theory of Q-learning can be generalized to problems with post-
decision states, where pij(u) is of the form q
_
f(i, u), j
_
(cf. Section 6.1.4). In
particular, for such problems one may develop similar asynchronous simulation-
based versions of value iteration for computing the optimal cost-to-go function
V
of the post-decision states m = f(i, u): the mapping F of Eq. (6.179) is

replaced by the mapping H given by
(HV )(m) =
n
j=1
q(m, j) min
uU(j)
_
g(j, u) +V
_
f(j, u)
_
, m,
[cf. Eq. (6.11)]. Q-learning corresponds to the special case where f(i, u) = (i, u).
In particular we will relate the Q-learning algorithm (6.180)-(6.181)
to an (idealized) value iteration-type algorithm, which is dened by the
same innitely long sequence (i
k
, u
k
), and is given by
Q
k+1
(i, u) =
_
(FQ
k
)(i
k
, u
k
) if (i, u) = (i
k
, u
k
),
Q
k
(i, u) if (i, u) ,= (i
k
, u
k
),
(6.185)
where F is the mapping (6.179). Compared to the Q-learning algorithm
(6.180)-(6.181), we note that this algorithm:
(a) Also updates at iteration k only the Q-factor corresponding to the pair
(i
k
, u
k
), while it leaves all the other Q-factors are left unchanged.
(b) Involves the mapping F in place of F
k
and a stepsize equal to 1 instead
of
k
.
We can view the algorithm (6.185) as a special case of an asyn-
chronous value iteration algorithm of the type discussed in Section 1.3.
Using the analysis of Gauss-Seidel value iteration and related methods
given in that section, it can be shown that the algorithm (6.185) converges
to the optimal Q-factor vector provided all state-control pairs (i, u) are
generated innitely often within the sequence (i
k
, u
k
).
Suppose now that we replace the expected value in the denition
(6.179) of F, with a Monte Carlo estimate based on all the samples up to
time k that involve (i
k
, u
k
). Letting n
k
be the number of times the current
state control pair (i
k
, u
k
) has been generated up to and including time k,
and
T
k
=
_
t [ (i
t
, u
t
) = (i
k
, u
k
), 0 t k
_
be the set of corresponding time indexes, we obtain the following algorithm:
Q
k+1
(i, u) = (

F
k
Q
k
)(i, u), (i, u), (6.186)
where

F
k
Q
k
is dened by
(

F
k
Q
k
)(i
k
, u
k
) =
1
n
k
tT
k
_
g(i
t
, u
t
, j
t
) + min
vU(j
t
)
Q
k
(j
t
, v)
_
, (6.187)
Generally, iteration with a mapping that is either a contraction with re-
spect to a weighted sup-norm, or has some monotonicity properties and a xed
point, converges when executed asynchronously (i.e., with dierent frequencies
for dierent components, and with iterates that are not up-to-date). One or
both of these properties are present in discounted and stochastic shortest path
problems. As a result, there are strong asynchronous convergence guarantees for
value iteration for such problems, as shown in [Ber82]. A general convergence
theory of distributed asynchronous algorithms was developed in [Ber83] and has
formed the basis for the convergence analysis of totally asynchronous algorithms
in the book [BeT89].
(

F
k
Q
k
)(i, u) = Q
k
(i, u), (i, u) ,= (i
k
, u
k
). (6.188)
Comparing the preceding equations and Eq. (6.179), and using the law of
large numbers, it is clear that for each (i, u), we have with probability 1
lim
k, kT(i,u)
(

F
k
Q
k
)(i, u) = (FQ
k
)(i, u),
where T(i, u) =
_
k [ (i
k
, u
k
) = (i, u)
_
. From this, the sup-norm contrac-
tion property of F, and the attendant asynchronous convergence properties
of the value iteration algorithm Q := FQ, it can be shown that the algo-
rithm (6.186)-(6.188) converges with probability 1 to the optimal Q-factors
[assuming again that all state-control pairs (i, u) are generated innitely
often within the sequence (i
k
, u
k
)].
From the point of view of convergence rate, the algorithm (6.186)-
(6.188) is quite satisfactory, but unfortunately it may have a signicant
drawback: it requires excessive overhead per iteration to calculate the
Monte Carlo estimate (

F
k
Q
k
)(i
k
, u
k
) using Eq. (6.188). In particular, while
the term
1
n
k
tT
k
g(i
t
, u
t
, j
t
), in this equation can be recursively updated
with minimal overhead, the term
1
n
k
tT
k
min
vU(j
t
)
Q
k
(j
t
, v) (6.189)
must be completely recomputed at each iteration k, using the current vector
Q
k
. This may be impractical, since the above summation may potentially
involve a very large number of terms.
We note a special type of problem where the overhead involved in updating
the term (6.189) may be manageable. This is the case where for each pair (i, u)
the set S(i, u) of possible successor states j [the ones with pij(u) > 0] has small
cardinality. Then for each (i, u), we may maintain the numbers of times that each
successor state j S(i, u) has occurred up to time k, and use them to compute
eciently the troublesome term (6.189). In particular, we may implement the
algorithm (6.186)-(6.188) as
Q
k+1
(i
k
, u
k
) =
jS(i
k
,u
k
)
n
k
(j)
n
k
_
g(i
k
, u
k
, j) + min
vU(j)
Q
k
(j, v)
_
, (6.190)
where n
k
(j) is the number of times the transition (i
k
, j), j S(i
k
, u
k
), occurred
at state i
k
under u
k
, in the simulation up to time k, i.e., n
k
(j) is the cardinality
of the set
{jt = j | t T
k
}, j S(i
k
, u
k
).
Note that this amounts to replacing the probabilities pi
k
j(u
k
) in the mapping
(6.179) with their Monte Carlo estimates
n
k
(j)
n
k
. While the minimization term in
Eq. (6.190), min
vU(j)
Q
k
(j, v), has to be computed for all j S(i
k
, u
k
) [rather
Motivated by the preceding concern, let us modify the algorithm and
replace the oending term (6.189) in Eq. (6.188) with
1
n
k
tT
k
min
vU(j
t
)
Q
t
(j
t
, v), (6.191)
which can be computed recursively, with minimal extra overhead. This is
the algorithm (6.186), but with the Monte Carlo average (

F
k
Q
k
)(i
k
, u
k
)
of Eq. (6.188) approximated by replacing the term (6.189) with the term
(6.191), which depends on all the iterates Q
t
, t T
k
. This algorithm has
the form
Q
k+1
(i
k
, u
k
) =
1
n
k
tT
k
_
g(i
t
, u
t
, j
t
) + min
vU(j
t
)
Q
t
(j
t
, v)
_
, (i, u),
(6.192)
and
Q
k+1
(i, u) = Q
k
(i, u), (i, u) ,= (i
k
, u
k
). (6.193)
We now show that this (approximate) value iteration algorithm is essen-
tially the Q-learning algorithm (6.180)-(6.181).
than for just j
k
as in the Q-learning algorithm algorithm (6.180)-(6.181)] the
extra computation is not excessive if the cardinalities of S(i
k
, u
k
) and U(j),
j S(i
k
, u
k
), are small. This approach can be strengthened if some of the
probabilities pi
k
j(u
k
) are known, in which case they can be used directly in Eq.
(6.190). Generally, any estimate of pi
k
j(u
k
) can be used in place of n
k
(j)/n
k
, as
long as the estimate converges to pi
k
j(u
k
) as k .
A potentially more eective algorithm is to introduce a window of size
m 0, and consider a more general scheme that calculates the last m terms
of the sum in Eq. (6.189) exactly and the remaining terms according to the
approximation (6.191). This algorithm, a variant of Q-learning, replaces the
oending term (6.189) by
1
n
k
_
_
tT
k
, tkm
min
vU(j
t
)
Qt+m(jt, v) +
tT
k
, t>km
min
vU(j
t
)
Q
k
(jt, v)
_
_
, (6.194)
which may also be updated recursively. The algorithm updates at time k the val-
ues of min
vU(j
t
)
Q(jt, v) to min
vU(j
t
)
Q
k
(jt, v) for all t T
k
within the window
km t k, and xes them at the last updated value for t outside this window.
For m = 0, it reduces to the algorithm (6.192)-(6.193). For moderate values of
m it involves moderate additional overhead, and it is likely a more accurate ap-
proximation to the term (6.189) than the term (6.191) [min
vU(j
t
)
Qt+m(jt, v)
presumably approximates better than min
vU(j
t
)
Qt(jt, v) the correct term
min
vU(j
t
)
Q
k
(jt, v)].
Indeed, let us observe that the iteration (6.192) can be written as
Q
k+1
(i
k
, u
k
) =
n
k
1
n
k
Q
k
(i
k
, u
k
)+
1
n
k
_
g(i
k
, u
k
, j
k
) + min
vU(j
k
)
Q
k
(j
k
, v)
_
,
or
Q
k+1
(i
k
, u
k
) =
_
1
1
n
k
_
Q
k
(i
k
, u
k
) +
1
n
k
(F
k
Q
k
)(i
k
, u
k
),
where (F
k
Q
k
)(i
k
, u
k
) is given by the expression (6.181) used in the Q-
learning algorithm. Thus the algorithm (6.192)-(6.193) is the Q-learning
algorithm (6.180)-(6.181) with a stepsize
k
= 1/n
k
. It can be similarly
shown that the algorithm (6.192)-(6.193), equipped with a stepsize param-
eter, is equivalent to the Q-learning algorithm with a dierent stepsize,
say
k
=

n
k
,
where is a positive constant.
The preceding analysis provides a view of Q-learning as an approxi-
mation to asynchronous value iteration (updating one component at a time)
that uses Monte Carlo sampling in place of the exact expected value in the
mapping F of Eq. (6.179). It also justies the use of a diminishing stepsize
that goes to 0 at a rate proportional to 1/n
k
, where n
k
is the number of
times the pair (i
k
, u
k
) has been generated up to time k. However, it does
not constitute a convergence proof because the Monte Carlo estimate used
to approximate the expected value in the denition (6.179) of F is accu-
rate only in the limit, if Q
k
converges. We refer to Tsitsiklis [Tsi94] for
a rigorous proof of convergence of Q-learning, which uses the theoretical
machinery of stochastic approximation algorithms.
In practice, despite its theoretical convergence guaranties, Q-learning
has some drawbacks, the most important of which is that the number
of Q-factors/state-control pairs (i, u) may be excessive. To alleviate this
diculty, we may introduce a state aggregation scheme. Alternatively, we
may introduce a linear approximation architecture for the Q-factors, similar
to the policy evaluation schemes of Section 6.3. This is the subject of the
next subsection.
6.5.2 Q-Learning and Approximate Policy Iteration
We will now consider Q-learning methods with linear Q-factor approxima-
tion. As we discussed earlier (cf. Fig. 6.5.1), we may view Q-factors as
optimal costs of a certain discounted DP problem, whose states are the
state-control pairs (i, u). We may thus apply the TD/approximate policy
iteration methods of Section 6.3. For this, we need to introduce a linear
parametric architecture

Q(i, u, r),
Q(i, u, r) = (i, u)
r, (6.195)
where (i, u) is a feature vector that depends on both state and control.
At the typical iteration, given the current policy , these methods nd
an approximate solution

Q
(i, u, r) of the projected equation for Q-factors

corresponding to , and then obtain a new policy by
(i) = arg min
uU(i)
(i, u, r).
For example, similar to our discussion in Section 6.3.4, LSTD(0) with a
linear parametric architecture of the form (6.195) generates a trajectory
_
(i
0
, u
0
), (i
1
, u
1
), . . .
_
using the current policy [u
t
= (i
t
)], and nds at
time k the unique solution of the projected equation [cf. Eq. (6.53)]
k
t=0
(i
t
, u
t
)q
k,t
= 0,
where q
k,t
are the corresponding TD
q
k,t
= (i
t
, u
t
)
r
k
(i
t+1
, u
t+1
)
r
k
g(i
t
, u
t
, i
t+1
), (6.196)
[cf. Eq. (6.54)]. Also, LSPE(0) is given by [cf. Eq. (6.71)]
r
k+1
= r
k

k + 1
G
k
k
t=0
(i
t
, u
t
)q
k,t
, (6.197)
where is a positive stepsize, G
k
is a positive denite matrix, such as
G
k
=
_

k + 1
I +
1
k + 1
k
t=0
(i
t
, u
t
)(i
t
, u
t
)
_
1
,
with > 0, or a diagonal approximation thereof.
There are also optimistic approximate policy iteration methods based
on LSPE(0), LSTD(0), and TD(0), similar to the ones we discussed earlier.
As an example, let us consider the extreme case of TD(0) that uses a single
sample between policy updates. At the start of iteration k, we have the
current parameter vector r
k
, we are at some state i
k
, and we have chosen
a control u
k
. Then:
(1) We simulate the next transition (i
k
, i
k+1
) using the transition proba-
bilities p
i
k
j
(u
k
).
(2) We generate the control u
k+1
from the minimization
u
k+1
= arg min
uU(i
k+1
)
Q(i
k+1
, u, r
k
). (6.198)
(3) We update the parameter vector via
r
k+1
= r
k

k
(i
k
, u
k
)q
k,k
, (6.199)
where
k
is a positive stepsize, and q
k,k
is the TD
q
k,k
= (i
k
, u
k
)
r
k
(i
k+1
, u
k+1
)
r
k
g(i
k
, u
k
, i
k+1
);
[cf. Eq. (6.196)].
The process is now repeated with r
k+1
, i
k+1
, and u
k+1
replacing r
k
, i
k
,
and u
k
, respectively.
Exploration
In simulation-based methods, a major concern is the issue of exploration in
the approximate policy evaluation step, to ensure that state-control pairs
(i, u) ,=
_
i, (i)
_
are generated suciently often in the simulation. For this,
the exploration-enhanced schemes discussed in Section 6.3.7 may be used in
conjunction with LSTD. As an example, given the current policy , we may
use any exploration-enhanced transition mechanism to generate a sequence
_
(i
0
, u
0
), (i
1
, u
1
), . . .
_
, and then use LSTD(0) with extra transitions
(i
k
, u
k
)
_
j
k
, (j
k
)
_
,
where j
k
is generated from (i
k
, u
k
) using the transition probabilities p
i
k
j
(u
k
)
(cf. Section 6.3.7). Because LSTD(0) does not require that the under-
lying mapping T be a contraction, we may design a single sequence
_
(i
k
, u
k
, j
k
)
_
that is appropriately exploration-enhanced, and reuse it for
all policies generated during policy iteration. This scheme results in sub-
stantial economies in simulation overhead. However, it can be used only for
= 0, since the simulation samples of multistep policy evaluation methods
must depend on the policy.
Alternatively, for > 0, we may use an exploration scheme based on
LSTD() with modied temporal dierences (cf. Section 6.3.7). In such a
scheme, we generate a sequence of state-control pairs
_
(i
0
, u
0
), (i
1
, u
1
), . . .
_
according to transition probabilities
p
i
k
i
k+1
(u
k
)(u
k+1
[ i
k+1
),
where (u [ i) is a probability distribution over the control constraint set
U(i), which provides a mechanism for exploration. Note that in this case
the probability ratios in the modied temporal dierence of Eq. (6.123)
have the form
p
i
k
i
k+1
(u
k
)
_
u
k+1
= (i
k+1
)
_
p
i
k
i
k+1
(u
k
)(u
k+1
[ i
k+1
)
=
_
u
k+1
= (i
k+1
)
_
(u
k+1
[ i
k+1
)
,
and do not depend on the transitions probabilities p
i
k
i
k+1
(u
k
). Generally,
in the context of Q-learning, the required amount of exploration is likely to
be substantial, so the underlying mapping T may not be a contraction,
in which case the validity of LSPE() or TD() comes into doubt (unless
is very close to 1), as discussed in Section 6.3.7.
As in other forms of policy iteration, the behavior of all the algorithms
described is very complex, involving for example near-singular matrix in-
version (cf. Section 6.3.4) or policy oscillations (cf. Section 6.3.8), and there
is no guarantee of success (except for general error bounds for approximate
policy iteration methods). However, Q-learning with approximate policy
iteration is often tried because of its model-free character [it does not re-
quire knowledge of p
ij
(u)].
6.5.3 Q-Learning for Optimal Stopping Problems
The policy evaluation algorithms of Section 6.3, such as TD(), LSPE(),
and LSTD(), apply when there is a single policy to be evaluated in the
context of approximate policy iteration. We may try to extend these meth-
ods to the case of multiple policies, by aiming to solve by simulation the
projected equation
r = T(r),
where T is a DP mapping that now involves minimization over multiple
controls. However, there are some diculties:
(a) The mapping T is nonlinear, so a simulation-based approximation
approach like LSTD breaks down.
(b) T may not in general be a contraction with respect to any norm, so
the PVI iteration
r
k+1
= T(r
k
)
[cf. Eq. (6.42)] may diverge and simulation-based LSPE-like approx-
imations may also diverge.
(c) Even if T is a contraction, so the above PVI iteration converges,
the simulation-based LSPE-like approximations may not admit an
ecient recursive implementation because T(r
k
) is a nonlinear func-
tion of r
k
.
In this section we discuss the extension of iterative LSPE-type ideas for
the special case of an optimal stopping problem where the last two di-
culties noted above can be largely overcome. Optimal stopping problems
are a special case of DP problems where we can only choose whether to
terminate at the current state or not. Examples are problems of search, se-
quential hypothesis testing, and pricing of derivative nancial instruments
(see Section 4.4 of Vol. I, and Section 3.4 of the present volume).
We are given a Markov chain with state space 1, . . . , n, described
by transition probabilities p
ij
. We assume that the states form a single
recurrent class, so that the steady-state distribution vector = (
1
, . . . ,
n
)
satises
i
> 0 for all i, as in Section 6.3. Given the current state i, we
assume that we have two options: to stop and incur a cost c(i), or to
continue and incur a cost g(i, j), where j is the next state (there is no
control to aect the corresponding transition probabilities). The problem
is to minimize the associated -discounted innite horizon cost.
We associate a Q-factor with each of the two possible decisions. The
Q-factor for the decision to stop is equal to c(i). The Q-factor for the
decision to continue is denoted by Q(i), and satises Bellmans equation
Q(i) =
n
j=1
p
ij
_
g(i, j) +min
_
c(j), Q(j)
_
_
. (6.200)
The Q-learning algorithm generates an innitely long sequence of states
i
0
, i
1
, . . ., with all states generated innitely often, and a corresponding
sequence of transitions
_
(i
k
, j
k
)
_
, generated according to the transition
probabilities p
i
k
j
. It updates the Q-factor for the decision to continue as
follows [cf. Eqs. (6.180)-(6.181)]:
Q
k+1
(i) = (1
k
)Q
k
(i) +
k
(F
k
Q
k
)(i), i,
where the components of the mapping F
k
are dened by
(F
k
Q)(i
k
) = g(i
k
, j
k
) +min
_
c(j
k
), Q(j
k
)
_
,
and
(F
k
Q)(i) = Q(i), i ,= i
k
.
The convergence of this algorithm is addressed by the general the-
ory of Q-learning discussed earlier. Once the Q-factors are calculated, an
optimal policy can be implemented by stopping at state i if and only if
c(i) Q(i). However, when the number of states is very large, the algo-
rithm is impractical, which motivates Q-factor approximations.
Let us introduce the mapping F :
n

n
given by
(FQ)(i) =
n
j=1
p
ij
_
g(i, j) +min
_
c(j), Q(j)
_
_
.
This mapping can be written in more compact notation as
FQ = g +Pf(Q),
where g is the vector whose ith component is
n
j=1
p
ij
g(i, j), (6.201)
and f(Q) is the function whose jth component is
f
j
(Q) = min
_
c(j), Q(j)
_
. (6.202)
We note that the (exact) Q-factor for the choice to continue is the unique
xed point of F [cf. Eq. (6.200)].
Let | |
be the weighted Euclidean norm associated with the steady-

state probability vector . We claim that F is a contraction with respect
to this norm. Indeed, for any two vectors Q and Q, we have
(FQ)(i) (FQ)(i)
j=1
p
ij
f
j
(Q) f
j
(Q)
j=1
p
ij
Q(j) Q(j)
,
or
[FQFQ[ P[QQ[,
where we use the notation [x[ to denote a vector whose components are the
absolute values of the components of x. Hence,
|FQFQ|

_
_
P[Q Q[
_
_
|QQ|
,
where the last step follows from the inequality |PJ|
|J|
, which holds
for every vector J (cf. Lemma 6.3.1). We conclude that F is a contraction
with respect to | |
, with modulus .
We will now consider Q-factor approximations, using a linear approx-
imation architecture
Q(i, r) = (i)
r,
where (i) is an s-dimensional feature vector associated with state i. We
also write the vector
_
Q(1, r), . . . ,

Q(n, r)
_
in the compact form r, where as in Section 6.3, is the n s matrix

whose rows are (i)
, i = 1, . . . , n. We assume that has rank s, and we

denote by the projection mapping with respect to | |
on the subspace
S = r [ r
s
.
Because F is a contraction with respect to | |
with modulus , and

is nonexpansive, the mapping F is a contraction with respect to | |
with modulus . Therefore, the algorithm

r
k+1
= F(r
k
) (6.203)
converges to the unique xed point of F. This is the analog of the PVI
algorithm (cf. Section 6.3.2).
As in Section 6.3.2, we can write the PVI iteration (6.203) as
r
k+1
= arg min
r
s
_
_
r
_
g +Pf(r
k
)
__
_
2
, (6.204)
where g and f are dened by Eqs. (6.201) and (6.202). By setting to 0 the
gradient of the quadratic function in Eq. (6.204), we see that the iteration
is written as
r
k+1
= r
k
(
)
1
_
C(r
k
) d
_
,
where
C(r
k
) =
_
r
k
Pf(r
k
)
_
, d =
g.
Similar to Section 6.3.3, we may implement a simulation-based ap-
proximate version of this iteration, thereby obtaining an analog of the
LSPE(0) method. In particular, we generate a single innitely long sim-
ulated trajectory (i
0
, i
1
, . . .) corresponding to an unstopped system, i.e.,
using the transition probabilities p
ij
. Following the transition (i
k
, i
k+1
),
we update r
k
by
r
k+1
= r
k

_
k
t=0
(i
t
)(i
t
)
_
1
k
t=0
(i
t
)q
k,t
, (6.205)
where q
k,t
is the TD,
q
k,t
= (i
t
)
r
k
min
_
c(i
t+1
), (i
t+1
)
r
k
_
g(i
t
, i
t+1
). (6.206)
Similar to the calculations involving the relation between PVI and LSPE,
it can be shown that r
k+1
as given by this iteration is equal to the iterate
produced by the iteration r
k+1
= F(r
k
) plus a simulation-induced
error that asymptotically converges to 0 with probability 1 (see the paper
by Yu and Bertsekas [YuB07], to which we refer for further analysis). As
a result, the generated sequence r
k
asymptotically converges to the
unique xed point of F. Note that similar to discounted problems, we
may also use a scaled version of the PVI algorithm,
r
k+1
= r
k
G
_
C(r
k
) d
_
, (6.207)
where is a positive stepsize, and G is a scaling matrix. If G is positive
denite symmetric can be shown that this iteration converges to the unique
solution of the projected equation if is suciently small. [The proof of
this is somewhat more complicated than the corresponding proof of Section
6.3.2 because C(r
k
) depends nonlinearly on r
k
. It requires algorithmic
analysis using the theory of variational inequalities; see [Ber09b], [Ber11a].]
We may approximate the scaled PVI algorithm (6.207) by a simulation-
based scaled LSPE version of the form
r
k+1
= r
k

k + 1
G
k
k
t=0
(i
t
)q
k,t
,
where G
k
is a positive denite symmetric matrix and is a suciently
small positive stepsize. For example, we may use a diagonal approximation
to the inverse in Eq. (6.205).
In comparing the Q-learning iteration (6.205)-(6.206) with the al-
ternative optimistic LSPE version (6.197), we note that it has consider-
ably higher computation overhead. In the process of updating r
k+1
via
Eq. (6.205), we can compute the matrix
k
t=0
(i
t
)(i
t
)
and the vector
k
t=0
(i
t
)q
k,t
iteratively as in the LSPE algorithms of Section 6.3. How-
ever, the terms
min
_
c(i
t+1
), (i
t+1
)
r
k
_
in the TD formula (6.206) need to be recomputed for all the samples i
t+1
,
t k. Intuitively, this computation corresponds to repartitioning the states
into those at which to stop and those at which to continue, based on the
current approximate Q-factors r
k
. By contrast, in the corresponding
optimistic LSPE version (6.197), there is no repartitioning, and these terms
are replaced by w(i
t+1
, r
k
), given by
w(i
t+1
, r
k
) =
_
c(i
t+1
) if t T,
(i
t+1
)
r
k
if t / T,
where
T =
_
t [ c(i
t+1
) (i
t+1
)
r
t
_
is the set of states to stop based on the approximate Q-factors r
t
, calcu-
lated at time t (rather than the current time k). In particular, the term
k
t=0
(i
t
) min
_
c(i
t+1
), (i
t+1
)
r
k
_
in Eqs. (6.205), (6.206) is replaced by
k
t=0
(i
t
) w(i
t+1
, r
k
) =
tk, tT
(i
t
)c(i
t+1
) +
_
_
tk, t/ T
(i
t
)(i
t+1
)
_
_
r
k
,
(6.208)
which can be eciently updated at each time k. It can be seen that the
optimistic algorithm that uses the expression (6.208) (no repartitioning)
can only converge to the same limit as the nonoptimistic version (6.205).
However, there is no convergence proof of this algorithm at present.
Another variant of the algorithm, with a more solid theoretical foun-
dation, is obtained by simply replacing the term (i
t+1
)
r
k
in the TD
formula (6.206) by (i
t+1
)
r
t
, thereby eliminating the extra overhead for
repartitioning. The idea is that for large k and t, these two terms are close
to each other, so convergence is still maintained. The convergence analysis
of this algorithm and some variations is based on the theory of stochastic
approximation methods, and is given in the paper by Yu and Bertsekas
[YuB07] to which we refer for further discussion.
Constrained Policy Iteration and Optimal Stopping
It is natural in approximate DP to try to exploit whatever prior information
is available about J
. In particular, if it is known that J
belongs to a
subset of
n
, we may try to nd an approximation r that belongs to
. This leads to projected equations involving projection on a restricted
subset of the approximation subspace S. Corresponding analogs of the
LSTD and LSPE-type methods for such projected equations involve the
solution of linear variational inequalities rather linear systems of equations.
The details of this are beyond our scope, and we refer to [Ber09b], [Ber11a]
for a discussion.
In the practically common case where an upper bound of J
is avail-
able, a simple possibility is to modify the policy iteration algorithm. In
particular, suppose that we know a vector

J with

J(i) J
(i) for all i.

Then the approximate policy iteration method can be modied to incorpo-
rate this knowledge as follows. Given a policy , we evaluate it by nding
an approximation r
to the solution

J
of the equation
(i) =
n
j=1
p
ij
_
(i)
_
_
g(i, (i), j) +min
_
J(j),

J
(j)
_
_
, i = 1, . . . , n,
(6.209)
followed by the (modied) policy improvement
(i) = arg min
uU(i)
n
j=1
p
ij
(u)
_
g(i, u, j) +min
_
J(j), (j)
_
_
,
i = 1, . . . , n,
(6.210)
where (j)
is the row of that corresponds to state j.

Note that Eq. (6.209) is Bellmans equation for the Q-factor of an
optimal stopping problem that involves the stopping cost

J(i) at state i
[cf. Eq. (6.200)]. Under the assumption

J(i) J
(i) for all i, and a lookup

table representation ( = I), it can be shown that the method (6.209)-
(6.210) yields J
in a nite number of iterations, just like the standard

(exact) policy iteration method (Exercise 6.17). When a compact feature-
based representation is used ( ,= I), the approximate policy evaluation
based on Eq. (6.209) can be performed using the Q-learning algorithms de-
scribed earlier in this section. The method may exhibit oscillatory behavior
and is subject to chattering, similar to its unconstrained policy iteration
counterpart (cf. Section 6.3.8).
6.5.4 Finite-Horizon Q-Learning
We will now briey discuss Q-learning and related approximations for nite-
horizon problems. We will emphasize on-line algorithms that are suitable
for relatively short horizon problems. Such problems are additionally im-
portant because they arise in the context of multistep lookahead and rolling
horizon schemes, possibly with cost function approximation at the end of
the horizon.
One may develop extensions of the Q-learning algorithms of the pre-
ceding sections to deal with nite horizon problems, with or without cost
function approximation. For example, one may easily develop versions
of the projected Bellman equation, and corresponding LSTD and LSPE-
type algorithms (see the end-of-chapter exercises). However, with a nite
horizon, there are a few alternative approaches, with an on-line character,
which resemble rollout algorithms. In particular, at state-time pair (i
k
, k),
we may compute approximate Q-factors
Q
k
(i
k
, u
k
), u
k
U
k
(i
k
),
and use on-line the control u
k
U
k
(i
k
) that minimizes

Q
k
(i
k
, u
k
) over
u
k
U
k
(i
k
). The approximate Q-factors have the form
Q
k
(i
k
, u
k
) =
n
k
i
k+1
=1
p
i
k
i
k+1
(u
k
)
_
g(i
k
, u
k
, i
k+1
)
+ min
u
k+1
U
k+1
(i
k+1
)
Q
k+1
(i
k+1
, u
k+1
)
_
,
(6.211)
where

Q
k+1
may be computed in a number of ways:
(1)

Q
k+1
may be the cost function

J
k+1
of a base heuristic (and is thus
independent of u
k+1
), in which case Eq. (6.211) takes the form
Q
k
(i
k
, u
k
) =
n
k
i
k+1
=1
p
i
k
i
k+1
(u
k
)
_
g(i
k
, u
k
, i
k+1
) +

J
k+1
(i
k+1
)
_
.
(6.212)
This is the rollout algorithm discussed at length in Chapter 6 of Vol. I.
A variation is when multiple base heuristics are used and

J
k+1
is the
minimum of the cost functions of these heuristics. These schemes may
also be combined with a rolling and/or limited lookahead horizon.
(2)

Q
k+1
is an approximately optimal cost function

J
k+1
[independent
of u
k+1
as in Eq. (6.212)], which is computed by (possibly multistep
lookahead or rolling horizon) DP based on limited sampling to ap-
proximate the various expected values arising in the DP algorithm.
Thus, here the function

J
k+1
of Eq. (6.212) corresponds to a (nite-
horizon) near-optimal policy in place of the base policy used by roll-
out. These schemes are well suited for problems with a large (or
innite) state space but only a small number of controls per state,
Sec. 6.6 Stochastic Shortest Path Problems 457
and may also involve selective pruning of the control constraint set
to reduce the associated DP computations. The book by Chang, Fu,
Hu, and Marcus [CFH07] has extensive discussions of approaches of
this type, including systematic forms of adaptive sampling that aim
to reduce the eects of limited simulation (less sampling for controls
that seem less promising at a given state, and less sampling for future
states that are less likely to be visited starting from the current state
i
k
).
(3)

Q
k+1
is computed using a linear parametric architecture of the form
Q
k+1
(i
k+1
, u
k+1
) = (i
k+1
, u
k+1
)
r
k+1
, (6.213)
where r
k+1
is a parameter vector. In particular,

Q
k+1
may be ob-
tained by a least-squares t/regression or interpolation based on val-
ues computed at a subset of selected state-control pairs (cf. Section
6.4.3 of Vol. I). These values may be computed by nite horizon
rollout, using as base policy the greedy policy corresponding to the
preceding approximate Q-values in a backwards (o-line) Q-learning
scheme:

i
(x
i
) = arg min
u
i
U
i
(x
i
)
Q
i
(x
i
, u
i
), i = k + 2, . . . , N 1. (6.214)
Thus, in such a scheme, we rst compute
Q
N1
(i
N1
, u
N1
) =
n
N
i
N
=1
p
i
N1
i
N
(u
N1
)
_
g(i
N1
, u
N1
, i
N
)
+J
N
(i
N
)
_
by the nal stage DP computation at a subset of selected state-control
pairs (i
N1
, u
N1
), followed by a least squares t of the obtained
values to obtain

Q
N1
in the form (6.213); then we compute

Q
N2
at a subset of selected state-control pairs (i
N2
, u
N2
) by rollout
using the base policy
N1
dened by Eq. (6.214), followed by a
least squares t of the obtained values to obtain

Q
N2
in the form
(6.213); then compute

Q
N3
at a subset of selected state-control pairs
(i
N3
, u
N3
) by rollout using the base policy
N2
,
N1
dened
by Eq. (6.214), etc.
One advantage of nite horizon formulations is that convergence is-
sues of the type arising in policy or value iteration methods do not play a
signicant role, so anomalous behavior does not arise. This is, however, a
mixed blessing as it may mask poor performance and/or important quali-
tative dierences between alternative approaches.
6.6 STOCHASTIC SHORTEST PATH PROBLEMS
In this section we consider policy evaluation for nite-state stochastic short-
est path (SSP) problems (cf. Chapter 2). We assume that there is no dis-
counting ( = 1), and that the states are 0, 1, . . . , n, where state 0 is a
special cost-free termination state. We focus on a xed proper policy ,
under which all the states 1, . . . , n are transient.
There are natural extensions of the LSTD() and LSPE() algo-
rithms. We introduce a linear approximation architecture of the form
J(i, r) = (i)
r, i = 0, 1, . . . , n,
and the subspace
S = r [ r
s
,
where, as in Section 6.3, is the n s matrix whose rows are (i)
, i =
1, . . . , n. We assume that has rank s. Also, for notational convenience
in the subsequent formulas, we dene (0) = 0.
The algorithms use a sequence of simulated trajectories, each of the
form (i
0
, i
1
, . . . , i
N
), where i
N
= 0, and i
t
,= 0 for t < N. Once a trajectory
is completed, an initial state i
0
for the next trajectory is chosen according
to a xed probability distribution q
0
=
_
q
0
(1), . . . , q
0
(n)
_
, where
q
0
(i) = P(i
0
= i), i = 1, . . . , n, (6.215)
and the process is repeated.
For a trajectory i
0
, i
1
, . . ., of the SSP problem consider the probabil-
ities
q
t
(i) = P(i
t
= i), i = 1, . . . , n, t = 0, 1, . . .
Note that q
t
(i) diminishes to 0 as t at the rate of a geometric pro-
gression (cf. Section 2.1), so the limits
q(i) =
t=0
q
t
(i), i = 1, . . . , n,
are nite. Let q be the vector with components q(1), . . . , q(n). We assume
that q
0
(i) are chosen so that q(i) > 0 for all i [a stronger assumption is
that q
0
(i) > 0 for all i]. We introduce the norm
|J|
q
=
_
n
i=1
q(i)
_
J(i)
_
2
,
and we denote by the projection onto the subspace S with respect to
this norm. In the context of the SSP problem, the projection norm | |
q
plays a role similar to the one played by the steady-state distribution norm
| |
for discounted problems (cf. Section 6.3).

Let P be the n n matrix with components p
ij
, i, j = 1, . . . , n.
Consider also the mapping T :
n

n
given by
TJ = g +PJ,
where g is the vector with components
n
j=0
p
ij
g(i, j), i = 1, . . . , n. For
[0, 1), dene the mapping
T
()
= (1 )
t=0
t
T
t+1
[cf. Eq. (6.83)]. Similar to Section 6.3, we have
T
()
J = P
()
J + (I P)
1
g,
where
P
()
= (1 )
t=0
t
P
t+1
(6.216)
[cf. Eq. (6.84)].
We will now show that T
()
is a contraction, so that it has a unique
xed point.
Proposition 6.6.1: For all [0, 1), T
()
is a contraction with
respect to some norm.
Proof: Let > 0. We will show that T
()
is a contraction with respect
to the projection norm | |
q
, so the same is true for T
()
, since is
nonexpansive. Let us rst note that with an argument like the one in the
proof of Lemma 6.3.1, we can show that
|PJ|
q
|J|
q
, J
n
.
Indeed, we have q =
t=0
q
t
and q
t+1
= q
t
P, so
q
P =
t=0
q
t
P =
t=1
q
t
= q
0
,
or
n
i=1
q(i)p
ij
= q(j) q
0
(j), j.
Using this relation, we have for all J
n
,
|PJ|
2
q
=
n
i=1
q(i)
_
_
n
j=1
p
ij
J(j)
_
_
2
i=1
q(i)
n
j=1
p
ij
J(j)
2
=
n
j=1
J(j)
2
n
i=1
q(i)p
ij
=
n
j=1
_
q(j) q
0
(j)
_
J(j)
2
|J|
2
q
.
(6.217)
From the relation |PJ|
q
|J|
q
it follows that
|P
t
J|
q
|J|
q
, J
n
, t = 0, 1, . . .
Thus, by using the denition (6.216) of P
()
, we also have
|P
()
J|
q
|J|
q
, J
n
.
Since lim
t
P
t
J = 0 for any J
n
, it follows that |P
t
J|
q
< |J|
q
for
all J ,= 0 and t suciently large. Therefore,
|P
()
J|
q
< |J|
q
, for all J ,= 0. (6.218)
We now dene
= max
_
|P
()
J|
q
[ |J|
q
= 1
_
and note that since the maximum in the denition of is attained by the
Weierstrass Theorem (a continuous function attains a maximum over a
compact set), we have < 1 in view of Eq. (6.218). Since
|P
()
J|
q
|J|
q
, J
n
,
it follows that P
()
is a contraction of modulus with respect to | |
q
.
Let = 0. We use a dierent argument because T is not necessarily
a contraction with respect to | |
q
. [An example is given following Prop.
6.8.2. Note also that if q
0
(i) > 0 for all i, from the calculation of Eq.
(6.217) it follows that P and hence T is a contraction with respect to
| |
q
.] We show that T is a contraction with respect to a dierent norm
by showing that the eigenvalues of P lie strictly within the unit circle.
Indeed, with an argument like the one used to prove Lemma 6.3.1, we
have |PJ|
q
|J|
q
for all J, which implies that |PJ|
q
|J|
q
, so the
eigenvalues of P cannot be outside the unit circle. Assume to arrive at
a contradiction that is an eigenvalue of P with [[ = 1, and let be
a corresponding eigenvector. We claim that P must have both real and
imaginary components in the subspace S. If this were not so, we would
have P ,= P, so that
|P|
q
> |P|
q
= ||
q
= [[ ||
q
= ||
q
,
which contradicts the fact |PJ|
q
|J|
q
for all J. Thus, the real and
imaginary components of P are in S, which implies that P = P = ,
so that is an eigenvalue of P. This is a contradiction because [[ = 1
while the eigenvalues of P are strictly within the unit circle, since the policy
being evaluated is proper. Q.E.D.
The preceding proof has shown that T
()
is a contraction with re-
spect to | |
q
when > 0. As a result, similar to Prop. 6.3.5, we can obtain
the error bound
|J
|
q

1
_
1
2
|J
|
q
, > 0,
where r
and
are the xed point and contraction modulus of T

()
,
respectively. When = 0, we have
|J
0
| |J
| +|J
0
|
= |J
| +|TJ
T(r
0
)|
= |J
| +
0
|J
0
|,
where | | is the norm with respect to which T is a contraction (cf. Prop.
6.6.1), and r
0
and
0
are the xed point and contraction modulus of T.
We thus have the error bound
|J
0
|
1
1
0
|J
|.
We use here the fact that if a square matrix has eigenvalues strictly within
the unit circle, then there exists a norm with respect to which the linear mapping
dened by the matrix is a contraction. Also in the following argument, the
projection z of a complex vector z is obtained by separately projecting the real
and the imaginary components of z on S. The projection norm for a complex
vector x +iy is dened by
x +iyq =
_
x
2
q
+y
2
q
.
Similar to the discounted problem case, the projected equation can
be written as a linear equation of the form Cr = d. The correspond-
ing LSTD and LSPE algorithms use simulation-based approximations C
k
and d
k
. This simulation generates a sequence of trajectories of the form
(i
0
, i
1
, . . . , i
N
), where i
N
= 0, and i
t
,= 0 for t < N. Once a trajectory is
completed, an initial state i
0
for the next trajectory is chosen according to
a xed probability distribution q
0
=
_
q
0
(1), . . . , q
0
(n)
_
. The LSTD method
approximates the solution C
1
d of the projected equation by C
1
k
d
k
, where
C
k
and d
k
are simulation-based approximations to C and d, respectively.
The LSPE algorithm and its scaled versions are dened by
r
k+1
= r
k
G
k
(C
k
r
k
d
k
),
where is a suciently small stepsize and G
k
is a scaling matrix. The
derivation of the detailed equations is straightforward but somewhat te-
dious, and will not be given (see also the discussion in Section 6.8).
Regarding exploration, let us note that the ideas of Sections 6.3.6
and 6.3.7 apply to policy iteration methods for SSP problems. However,
because the distribution q
0
for the initial state of the simulated trajectories
can be chosen arbitrarily, the problem of exploration may be far less acute
in SSP problems than in discounted problems, particularly when simulated
trajectories tend to be short. In this case, one may explore various parts
of the state space naturally through the restart mechanism, similar to the
exploration-enhanced LSPE() and LSTD() methods.
6.7 AVERAGE COST PROBLEMS
In this section we consider average cost problems and related approxima-
tions: policy evaluation algorithms such as LSTD() and LSPE(), ap-
proximate policy iteration, and Q-learning. We assume throughout the
nite state model of Section 4.1, with the optimal average cost being the
same for all initial states (cf. Section 4.2).
6.7.1 Approximate Policy Evaluation
Let us consider the problem of approximate evaluation of a stationary pol-
icy . As in the discounted case (Section 6.3), we consider a stationary
nite-state Markov chain with states i = 1, . . . , n, transition probabilities
p
ij
, i, j = 1, . . . , n, and stage costs g(i, j). We assume that the states form
a single recurrent class. An equivalent way to express this assumption is
the following.
Sec. 6.7 Average Cost Problems 463
Assumption 6.7.1: The Markov chain has a steady-state proba-
bility vector = (
1
, . . . ,
n
) with positive components, i.e., for all
i = 1, . . . , n,
lim
N
1
N
N
k=1
P(i
k
= j [ i
0
= i) =
j
> 0, j = 1, . . . , n.
From Section 4.2, we know that under Assumption 6.7.1, the average
cost, denoted by , is independent of the initial state
= lim
N
1
N
E
_
N1
k=0
g
_
x
k
, x
k+1
_
x
0
= i
_
, i = 1, . . . , n, (6.219)
and satises
=
g,
where g is the vector whose ith component is the expected stage cost
n
j=1
p
ij
g(i, j). (In Chapter 4 we denoted the average cost by , but
in the present chapter, with apologies to the readers, we reserve for use
in the TD, LSPE, and LSTD algorithms, hence the change in notation.)
Together with a dierential cost vector h =
_
h(1), . . . , h(n)
_
, the average
cost satises Bellmans equation
h(i) =
n
j=1
p
ij
_
g(i, j) +h(j)
_
, i = 1, . . . , n.
The solution is unique up to a constant shift for the components of h, and
can be made unique by eliminating one degree of freedom, such as xing
the dierential cost of a single state to 0 (cf. Prop. 4.2.4).
We consider a linear architecture for the dierential costs of the form
h(i, r) = (i)
r, i = 1, . . . , n.
where r
s
is a parameter vector and (i) is a feature vector associated
with state i. These feature vectors dene the subspace
S = r [ r
s
,
where as in Section 6.3, is the n s matrix whose rows are (i)
, i =
1, . . . , n. We will thus aim to approximate h by a vector in S, similar to
Section 6.3, which dealt with cost approximation in the discounted case.
We introduce the mapping F :
n

n
dened by
FJ = g e +PJ,
where P is the transition probability matrix and e is the unit vector. Note
that the denition of F uses the exact average cost , as given by Eq.
(6.219). With this notation, Bellmans equation becomes
h = Fh,
so if we know , we can try to nd or approximate a xed point of F.
Similar to Section 6.3, we introduce the projected equation
r = F(r),
where is projection on the subspace S with respect to the norm | |
.
An important issue is whether F is a contraction. For this it is necessary
to make the following assumption.
Assumption 6.7.2: The columns of the matrix together with the
unit vector e = (1, . . . , 1)
form a linearly independent set of vectors.

Note the dierence with the corresponding Assumption 6.3.2 for the
discounted case in Section 6.3. Here, in addition to having rank s, we
require that e does not belong to the subspace S. To get a sense why this
is needed, observe that if e S, then F cannot be a contraction, since
any scalar multiple of e when added to a xed point of F would also be
a xed point.
We also consider multistep versions of the projected equation of the
form
r = F
()
(r), (6.220)
where
F
()
= (1 )
t=0
t
F
t+1
.
In matrix notation, the mapping F
()
can be written as
F
()
J = (1 )
t=0
t
P
t+1
J +
t=0
t
P
t
(g e),
or more compactly as
F
()
J = P
()
J + (I P)
1
(g e), (6.221)
where the matrix P
()
is dened by
P
()
= (1 )
t=0
t
P
t+1
(6.222)
[cf. Eq. (6.84)]. Note that for = 0, we have F
(0)
= F and P
(0)
= P.
We wish to delineate conditions under which the mapping F
()
is a
contraction. The following proposition relates to the composition of general
linear mappings with Euclidean projections, and captures the essence of our
analysis.
Proposition 6.7.1: Let S be a subspace of
n
and let L :
n

n
be a linear mapping,
L(x) = Ax +b,
where A is an n n matrix and b is a vector in
n
. Let | | be
a weighted Euclidean norm with respect to which L is nonexpansive,
and let denote projection onto S with respect to that norm.
(a) L has a unique xed point if and only if either 1 is not an
eigenvalue of A, or else the eigenvectors corresponding to the
eigenvalue 1 do not belong to S.
(b) If L has a unique xed point, then for all (0, 1), the mapping
H
= (1 )I +L
is a contraction, i.e., for some scalar
(0, 1), we have

|H
x H
y|
|x y|, x, y
n
.
Proof: (a) Assume that L has a unique xed point, or equivalently (in
view of the linearity of L) that 0 is the unique xed point of A. If 1 is
an eigenvalue of A with a corresponding eigenvector z that belongs to S,
then Az = z and Az = z = z. Thus, z is a xed point of A with
z ,= 0, a contradiction. Hence, either 1 is not an eigenvalue of A, or else
the eigenvectors corresponding to the eigenvalue 1 do not belong to S.
Conversely, assume that either 1 is not an eigenvalue of A, or else the
eigenvectors corresponding to the eigenvalue 1 do not belong to S. We will
show that the mapping (I A) is one-to-one from S to S, and hence the
xed point of L is the unique vector x
S satisfying (I A)x
= b.
Indeed, assume the contrary, i.e., that (I A) has a nontrivial nullspace
in S, so that some z S with z ,= 0 is a xed point of A. Then, either
Az = z (which is impossible since then 1 is an eigenvalue of A, and z is a
corresponding eigenvector that belongs to S), or Az ,= z, in which case Az
diers from its projection Az and
|z| = |Az| < |Az| |A| |z|,
so that 1 < |A| (which is impossible since L is nonexpansive, and therefore
|A| 1), thereby arriving at a contradiction.
(b) If z
n
with z ,= 0 and z ,= aAz for all a 0, we have
|(1 )z +Az| < (1 )|z| +|Az| (1 )|z| +|z| = |z|,
(6.223)
where the strict inequality follows from the strict convexity of the norm,
and the weak inequality follows from the non-expansiveness of A. If on
the other hand z ,= 0 and z = aAz for some a 0, we have |(1 )z +
Az| < |z| because then L has a unique xed point so a ,= 1, and A
is nonexpansive so a < 1. If we dene
= sup|(1 )z +Az| [ |z| 1,

and note that the supremum above is attained by the Weierstrass Theorem
(a continuous function attains a minimum over a compact set), we see that
Eq. (6.223) yields
< 1 and
|(1 )z +Az|
|z|, z
n
.
By letting z = x y, with x, y
n
, and by using the denition of H
, we
have
H
xH
y = H
(xy) = (1)(xy) +A(xy) = (1)z +Az,

so by combining the preceding two relations, we obtain
|H
x H
y|
|x y|, x, y
n
.
Q.E.D.
We can now derive the conditions under which the mapping underly-
ing the LSPE iteration is a contraction with respect to | |
.
Proposition 6.7.2: The mapping
F
,
= (1 )I +F
()
if one of the following is true:

(i) (0, 1) and (0, 1],
(ii) = 0 and (0, 1).
Proof: Consider rst the case, = 1 and (0, 1). Then F
()
is a linear
mapping involving the matrix P
()
. Since 0 < and all states form a single
recurrent class, all entries of P
()
are positive. Thus P
()
can be expressed
as a convex combination
P
()
= (1 )I +

P
for some (0, 1), where

P is a stochastic matrix with positive entries.
We make the following observations:
(i)

P corresponds to a nonexpansive mapping with respect to the norm
| |
. The reason is that the steady-state distribution of

P is [as
can be seen by multiplying the relation P
()
= (1 )I +

P with ,
and by using the relation
P
()
to verify that
=

P]. Thus,
we have |
Pz|
|z|
for all z
n
(cf. Lemma 6.3.1), implying
that

P has the nonexpansiveness property mentioned.
(ii) Since

P has positive entries, the states of the Markov chain corre-
sponding to

P form a single recurrent class. If z is an eigenvector of
P corresponding to the eigenvalue 1, we have z =

P
k
z for all k 0,
so z =

P
z, where
= lim
N
(1/N)
N1
k=0
P
k
(cf. Prop. 4.1.2). The rows of

P
are all equal to
since the steady-

state distribution of

P is , so the equation z =

P
z implies that z
is a nonzero multiple of e. Using Assumption 6.7.2, it follows that z
does not belong to the subspace S, and from Prop. 6.7.1 (with

P in
place of C, and in place of ), we see that P
()
is a contraction
with respect to the norm | |
. This implies that F

()
is also a
contraction.
Consider next the case, (0, 1) and (0, 1). Since F
()
is a
contraction with respect to | |
, as just shown, we have for any J,

J
n
,
|F
,
J F
,

J|
(1 )|J

J|
+
_
_
F
()
J F
()
J
_
_
(1 +)|J

J|
,
where is the contraction modulus of F
()
. Hence, F
,
is a contraction.
Finally, consider the case (0, 1) and = 0. We will show that
the mapping F has a unique xed point, by showing that either 1 is not
an eigenvalue of P, or else the eigenvectors corresponding to the eigenvalue
1 do not belong to S [cf. Prop. 6.7.1(a)]. Assume the contrary, i.e., that
some z S with z ,= 0 is an eigenvector corresponding to 1. We then have
z = Pz. From this it follows that z = P
k
z for all k 0, so z = P
z, where
P
= lim
N
(1/N)
N1
k=0
P
k
(cf. Prop. 4.1.2). The rows of P
are all equal to
, so the equation z = P
z
implies that z is a nonzero multiple of e. Hence, by Assumption 6.7.2, z
cannot belong to S - a contradiction. Thus F has a unique xed point,
and the contraction property of F
,
for (0, 1) and = 0 follows from
Prop. 6.7.1(b). Q.E.D.
Error Estimate
We have shown that for each [0, 1), there is a vector r
, the unique
xed point of F
,
, (0, 1), which is the limit of LSPE() (cf. Prop.
6.7.2). Let h be any dierential cost vector, and let
,
be the modulus of
contraction of F
,
, with respect to | |
. Similar to the proof of Prop.

6.3.2 for the discounted case, we have
|h r
|
2
= |h h|
2
+|h r
|
2
= |h h|
2
+
_
_
F
,
h F
,
(r
)
_
_
2
|h h|
2
+
,
|h r
|
2
.
It follows that
|h r

1
_
1
2
,
|h h|
, [0, 1), (0, 1), (6.224)

for all dierential cost vector vectors h.
This estimate is a little peculiar because the dierential cost vector
is not unique. The set of dierential cost vectors is
D =
_
h
+e [ ,
where h
is the bias of the policy evaluated (cf. Section 4.1, and Props.
4.1.1 and 4.1.2). In particular, h
is the unique h D that satises
h = 0
or equivalently P
h = 0, where
P
= lim
N
1
N
N1
k=0
P
k
.
Usually, in average cost policy evaluation, we are interested in obtaining
a small error (h r
) with the choice of h being immaterial (see the

discussion of the next section on approximate policy iteration). It follows
that since the estimate (6.224) holds for all h D, a better error bound
can be obtained by using an optimal choice of h in the left-hand side and
an optimal choice of in the right-hand side. Indeed, Tsitsiklis and Van
Roy [TsV99a] have obtained such an optimized error estimate. It has the
form
min
hD
|hr
=
_
_
h
(I P
)r
_
_

1
_
1
2
, (6.225)
where h
is the bias vector,
denotes projection with respect to | |
onto the subspace

S
=
_
(I P
)y [ y S
_
,
and
is the minimum over (0, 1) of the contraction modulus of the

mapping
F
,
:
= min
(0,1)
max
y
=1
|
P
,
y|
,
where P
,
= (1 )I +
P
()
. Note that this error bound has similar
form with the one for discounted problems (cf. Prop. 6.3.5), but S has
been replaced by S
and has been replaced by
. It can be shown that

the scalar
decreases as increases, and approaches 0 as 1. This

is consistent with the corresponding error bound for discounted problems
(cf. Prop. 6.3.5), and is also consistent with empirical observations, which
suggest that smaller values of lead to larger approximation errors.
Figure 6.7.1 illustrates and explains the projection operation
, the
distance of the bias h
from its projection
, and the other terms in

the error bound (6.225).
LSTD() and LSPE()
The LSTD() and LSPE() algorithms for average cost are straightforward
extensions of the discounted versions, and will only be summarized. The
LSTD() algorithm is given by
r
k
= C
1
k
d
k
.
There is also a regression-based version that is well-suited for cases where
C
k
is nearly singular (cf. Section 6.3.4). The LSPE() iteration can be
written (similar to the discounted case) as
r
k+1
= r
k
G
k
(C
k
r
k
d
k
), (6.226)
where is a positive stepsize and
C
k
=
1
k + 1
k
t=0
z
t
_
(i
t
)
(i
t+1
)
_
, G
k
=
_
1
k + 1
k
t=0
(i
t
) (i
t
)
_
1
,
E
*
: Subspace of vectors (I-P
*
)y
e
S: Subspace
spanned by
basis vectors
0
Bias h
*
Subspace S
*
(I-P
*
)r
*
r
*
D: Set of
Differential cost
vectors
*
h
*
Figure 6.7.1 Illustration of the estimate (6.225). Consider the subspace
E
=
_
(I P
)y | y
n
_
.
Let be the diagonal matrix with
1
, . . . , n on the diagonal. Note that:
(a) E
is the subspace that is orthogonal to the unit vector e in the scaled

geometry of the norm
, in the sense that e
z = 0 for all z E
.
Indeed we have
e
(I P
)y = 0, for all y
n
,
because e
and
(I P
) = 0 as can be easily veried from the fact

that the rows of P
are all equal to
.
(b) Projection onto E
with respect to the norm
is simply multiplication
with (I P
) (since P
y =
ye, so P
y is orthogonal to E
in the scaled
geometry of the norm
). Thus, S
is the projection of S onto E
.
(c) We have h
since (I P
)h
= h
in view of P
= 0.
(d) The equation
min
hD
h r
= h
(I P
)r
is geometrically evident from the gure. Also, the term
of the
error bound is the minimum possible error given that h
is approximated
with an element of S
.
(e) The estimate (6.225), is the analog of the discounted estimate of Prop.
6.3.5, with E
playing the role of the entire space, and with the geometry
of the problem projected onto E
. Thus, S
plays the role of S, h
plays
the role of J, (I P
)r
plays the role of r
, and
plays the role

of . Finally,
is the best possible contraction modulus of
F
,
over
(0, 1) and within E
(see the paper [TsV99a] for a detailed analysis).

d
k
=
1
k + 1
k
t=0
z
t
_
g(i
t
, i
t+1
)
t
_
, z
t
=
t
m=0
tm
(i
m
).
Scaled versions of this algorithm, where G
k
is a scaling matrix are also
possible.
The matrices C
k
, G
k
, and vector d
k
can be shown to converge to
limits:
C
k

(I P
()
), G
k

, d
k

g
()
, (6.227)
where the matrix P
()
is dened by Eq. (6.222), g
()
is given by
g
()
=
=0
(g e),
and is the diagonal matrix with diagonal entries
1
, . . . ,
n
:
= diag(
1
, . . . ,
n
),
[cf. Eqs. (6.84) and (6.85)].
6.7.2 Approximate Policy Iteration
Let us consider an approximate policy iteration method that involves ap-
proximate policy evaluation and approximate policy improvement. We
assume that all stationary policies are unichain, and a special state s is
recurrent in the Markov chain corresponding to each stationary policy. As
in Section 4.3.1, we consider the stochastic shortest path problem obtained
by leaving unchanged all transition probabilities p
ij
(u) for j ,= s, by setting
all transition probabilities p
is
(u) to 0, and by introducing an articial ter-
mination state t to which we move from each state i with probability p
is
(u).
The one-stage cost is equal to g(i, u) , where is a scalar parameter.
We refer to this stochastic shortest path problem as the -SSP.
The method generates a sequence of stationary policies
k
, a corre-
sponding sequence of gains
k, and a sequence of cost vectors h

k
. We
assume that for some > 0, we have
max
i=1,...,n
h
k
(i) h
k
,
k
(i)
, k = 0, 1, . . . ,
where
k
= min
m=0,1,...,k
m,
h
k
,
k
(i) is the cost-to-go from state i to the reference state s for the
k
-
SSP under policy
k
, and is a positive scalar quantifying the accuracy of
evaluation of the cost-to-go function of the
k
-SSP. Note that we assume
exact calculation of the gains
k . Note also that we may calculate ap-

proximate dierential costs

h
k
(i, r) that depend on a parameter vector r
without regard to the reference state s. These dierential costs may then
be replaced by
h
k
(i) =

h
k
(i, r)
h(s, r), i = 1, . . . , n.
We assume that policy improvement is carried out by approximate
minimization in the DP mapping. In particular, we assume that there exists
a tolerance > 0 such that for all i and k,
k+1
(i) attains the minimum in
the expression
min
uU(i)
n
j=1
p
ij
(u)
_
g(i, u, j) +h
k
(j)
_
,
within a tolerance > 0.
We now note that since
k
is monotonically nonincreasing and is
bounded below by the optimal gain
, it must converge to some scalar .

Since
k
can take only one of the nite number of values
corresponding
to the nite number of stationary policies , we see that
k
must converge
nitely to ; that is, for some k, we have
k
= , k k.
Let h
(s) denote the optimal cost-to-go from state s in the -SSP. Then,
by using Prop. 2.4.1, we have
limsup
k
_
h
k
,
(s) h
(s)
_
n(1 +n)( + 2)
(1 )
2
, (6.228)
where
= max
i=1,...,n,
P
_
i
k
,= s, k = 1, . . . , n [ i
0
= i,
_
,
and i
k
denotes the state of the system after k stages. On the other hand,
as can also be seen from Fig. 6.7.2, the relation

k
implies that
h
k
,
(s) h
k
,
k
(s) = 0.
It follows, using also Fig. 6.7.2, that
h
k
,
(s) h
(s) h
(s) h
,
(s) = (
)N
, (6.229)
where
is an optimal policy for the
-SSP (and hence also for the original

average cost per stage problem) and N
is the expected number of stages

(s)
h
(s) = (
(s)
h
(s) = (
Figure 6.7.2 Relation of the costs of stationary policies for the -SSP in the
approximate policy iteration method. Here, N is the expected number of stages
to return to state s, starting from s and using . Since
k , we have
h
k
,
(s) h
k
,
k
(s) = 0.
Furthermore, if
is an optimal policy for the
-SSP, we have
h
(s) h
,
(s) = (
)N
.
to return to state s, starting from s and using
. Thus, from Eqs. (6.228)

and (6.229), we have

n(1 +n)( + 2)
N
(1 )
2
. (6.230)
This relation provides an estimate on the steady-state error of the approx-
imate policy iteration method.
We nally note that optimistic versions of the preceding approximate
policy iteration method are harder to implement than their discounted cost
counterparts. The reason is our assumption that the gain
of every gen-
erated policy is exactly calculated; in an optimistic method the current
policy may not remain constant for suciently long time to estimate
accurately
. One may consider schemes where an optimistic version of

policy iteration is used to solve the -SSP for a xed . The value of
may occasionally be adjusted downward by calculating exactly through
simulation the gain
of some of the (presumably most promising) gener-

ated policies , and by then updating according to := min,
. An
alternative is to approximate the average cost problem with a discounted
problem, for which an optimistic version of approximate policy iteration
can be readily implemented.
6.7.3 Q-Learning for Average Cost Problems
To derive the appropriate form of the Q-learning algorithm, we form an
auxiliary average cost problem by augmenting the original system with one
additional state for each possible pair (i, u) with u U(i). Thus, the states
of the auxiliary problem are those of the original problem, i = 1, . . . , n,
together with the additional states (i, u), i = 1, . . . , n, u U(i). The
probabilistic transition mechanism from an original state i is the same as
for the original problem [probability p
ij
(u) of moving to state j], while the
probabilistic transition mechanism from a state (i, u) is that we move only
to states j of the original problem with corresponding probabilities p
ij
(u)
and costs g(i, u, j).
It can be seen that the auxiliary problem has the same optimal average
cost per stage as the original, and that the corresponding Bellmans
equation is
+h(i) = min
uU(i)
n
j=1
p
ij
(u)
_
g(i, u, j) +h(j)
_
, i = 1, . . . , n, (6.231)
+Q(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) +h(j)
_
, i = 1, . . . , n, u U(i),
(6.232)
where Q(i, u) is the dierential cost corresponding to (i, u). Taking the
minimum over u in Eq. (6.232) and comparing with Eq. (6.231), we obtain
h(i) = min
uU(i)
Q(i, u), i = 1, . . . , n.
Substituting the above form of h(i) in Eq. (6.232), we obtain Bellmans
equation in a form that exclusively involves the Q-factors:
+Q(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) + min
vU(j)
Q(j, v)
_
, i = 1, . . . , n, u U(i).
Let us now apply to the auxiliary problem the following variant of
the relative value iteration
h
k+1
= Th
k
h
k
(s)e,
where s is a special state. We then obtain the iteration [cf. Eqs. (6.231)
and (6.232)]
h
k+1
(i) = min
uU(i)
n
j=1
p
ij
(u)
_
g(i, u, j) +h
k
(j)
_
h
k
(s), i = 1, . . . , n,
Q
k+1
(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j)+h
k
(j)
_
h
k
(s), i = 1, . . . , n, u U(i).
(6.233)
From these equations, we have that
h
k
(i) = min
uU(i)
Q
k
(i, u), i = 1, . . . , n,
and by substituting the above form of h
k
in Eq. (6.233), we obtain the
following relative value iteration for the Q-factors
Q
k+1
(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) + min
vU(j)
Q
k
(j, v)
_
min
vU(s)
Q
k
(s, v).
The sequence of values min
uU(s)
Q
k
(s, u) is expected to converge to the
optimal average cost per stage and the sequences of values min
uU(i)
Q(i, u)
are expected to converge to dierential costs h(i).
An incremental version of the preceding iteration that involves a pos-
itive stepsize is given by
Q(i, u) := (1 )Q(i, u) +
_
n
j=1
p
ij
(u)
_
g(i, u, j) + min
vU(j)
Q(j, v)
_
min
vU(s)
Q(s, v)
_
.
The natural form of the Q-learning method for the average cost problem
is an approximate version of this iteration, whereby the expected value is
replaced by a single sample, i.e.,
Q(i, u) := Q(i, u) +
_
g(i, u, j) + min
vU(j)
Q(j, v) min
vU(s)
Q(s, v)
Q(i, u)
_
,
where j and g(i, u, j) are generated from the pair (i, u) by simulation. In
this method, only the Q-factor corresponding to the currently sampled pair
(i, u) is updated at each iteration, while the remaining Q-factors remain
unchanged. Also the stepsize should be diminishing to 0. A convergence
analysis of this method can be found in the paper by Abounadi, Bertsekas,
and Borkar [ABB01].
Q-Learning Based on the Contracting Value Iteration
We now consider an alternative Q-learning method, which is based on the
contracting value iteration method of Section 4.3. If we apply this method
to the auxiliary problem used above, we obtain the following algorithm
h
k+1
(i) = min
uU(i)
_
_
n
j=1
p
ij
(u)g(i, u, j) +
n
j=1
j=s
p
ij
(u)h
k
(j)
_
_
k
, (6.234)
Q
k+1
(i, u) =
n
j=1
p
ij
(u)g(i, u, j) +
n
j=1
j=s
p
ij
(u)h
k
(j)
k
, (6.235)
k+1
=
k
+
k
h
k+1
(s).
From these equations, we have that
h
k
(i) = min
uU(i)
Q
k
(i, u),
and by substituting the above form of h
k
in Eq. (6.235), we obtain
Q
k+1
(i, u) =
n
j=1
p
ij
(u)g(i, u, j) +
n
j=1
j=s
p
ij
(u) min
vU(j)
Q
k
(j, v)
k
,
k+1
=
k
+
k
min
vU(s)
Q
k+1
(s, v).
A small-stepsize version of this iteration is given by
Q(i, u) := (1 )Q(i, u) +
_
n
j=1
p
ij
(u)g(i, u, j)
+
n
j=1
j=s
p
ij
(u) min
vU(j)
Q(j, v)
_
,
:= + min
vU(s)
Q(s, v),
where and are positive stepsizes. A natural form of Q-learning based
on this iteration is obtained by replacing the expected values by a single
sample, i.e.,
Q(i, u) := (1 )Q(i, u) +
_
g(i, u, j) + min
vU(j)
Q(j, v)
_
, (6.236)
Sec. 6.8 Simulation-Based Solution of Large Systems 477
:= + min
vU(s)
Q(s, v), (6.237)
where
Q(j, v) =
_
Q(j, v) if j ,= s,
0 otherwise,
and j and g(i, u, j) are generated from the pair (i, u) by simulation. Here
the stepsizes and should be diminishing, but should diminish faster
than ; i.e., the ratio of the stepsizes / should converge to zero. For
example, we may use = C/k and = c/k log k, where C and c are positive
constants and k is the number of iterations performed on the corresponding
pair (i, u) or , respectively.
The algorithm has two components: the iteration (6.236), which is
essentially a Q-learning method that aims to solve the -SSP for the current
value of , and the iteration (6.237), which updates towards its correct
value
. However, is updated at a slower rate than Q, since the stepsize

ratio / converges to zero. The eect is that the Q-learning iteration
(6.236) is fast enough to keep pace with the slower changing -SSP. A
convergence analysis of this method can also be found in the paper [ABB01].
6.8 SIMULATION-BASED SOLUTION OF LARGE SYSTEMS
We have focused so far in this chapter on approximating the solution of
Bellman equations within a subspace of basis functions in a variety of con-
texts. We have seen common analytical threads across discounted, SSP,
and average cost problems, as well as dierences in formulations, imple-
mentation details, and associated theoretical results. In this section we
will aim for a more general view of simulation-based solution of large sys-
tems within which the methods and analysis so far can be understood and
extended. The benet of this analysis is a deeper perspective, and the
ability to address more general as well as new problems in DP and beyond.
For most of this section we consider simulation-based methods for
solving the linear xed point equation
x = b +Ax,
where A is an n n matrix and b is an n-dimensional vector, with compo-
nents denoted a
ij
and b
i
, respectively. These methods are divided in two
major categories, which are based on distinctly dierent philosophies and
lines of analysis:
(a) Stochastic approximation methods, which have the form
x
k+1
= (1
k
)x
k
+
k
(b +Ax
k
+w
k
),
where w
k
is zero-mean noise. Here the term b + Ax
k
+ w
k
may be
viewed as a simulation sample of b + Ax, and
k
is a diminishing
positive stepsize (
k
0). These methods were discussed briey in
Section 6.1.6. A prime example within our context is TD(), which
is a stochastic approximation method for solving the (linear) multi-
step projected equation C
()
r = d
()
corresponding to evaluation of a
single policy (cf. Section 6.3.6). The Q-learning algorithm of Section
6.5.1 is also a stochastic approximation method, but it solves a non-
linear xed point problem - Bellmans equation for multiple policies.
(b) Monte-Carlo estimation methods, which obtain Monte-Carlo estimates
A
m
and b
m
, based on m samples, and use them in place of A and b
in various deterministic methods. Thus an approximate xed point
may be obtained by matrix inversion,
x = (I A
m
)
1
b
m
,
or iteratively by
x
k+1
= (1 )x
k
+(b
m
+A
m
x
k
), k = 0, 1, . . . , (6.238)
where is a constant positive stepsize. In a variant of the iterative
approach the estimates A
m
and b
m
are updated as the simulation
samples are collected, in which case the method (6.238) takes the
form
x
k+1
= (1 )x
k
+(b
k
+A
k
x
k
), k = 0, 1, . . . .
The LSTD-type methods are examples of the matrix inversion ap-
proach, while the LSPE-type methods are examples of the iterative
approach.
Stochastic approximation methods, generally speaking, tend to be
simpler but slower. They are simpler partly because they involve a single
vector sample rather than matrix-vector estimates that are based on many
samples. They are slower because their iterations involve more noise per
iteration (a single sample rather than a Monte-Carlo average), and hence
require a diminishing stepsize. Basically, stochastic approximation meth-
ods combine the iteration and Monte-Carlo estimation processes, while
methods such as Eq. (6.238) separate the two processes to a large extent.
We should also mention that the xed point problem x = b + Ax
may involve approximations or multistep mappings (cf. Section 6.3.6). For
example it may result from a projected equation approach or from an ag-
gregation approach.
In this section, we will focus on Monte-Carlo estimation methods
where x is approximated within a subspace S = r [ r
s
. In the spe-
cial case where = I, we obtain lookup table-type methods, where there
is no subspace approximation. We start with the projected equation ap-
proach, we continue with the related Bellman equation error methods, and
nally we consider aggregation approaches. On occasion we discuss various
extensions, involving for example nonlinear xed point problems. We do
not provide a rigorous discussion of stochastic approximation methods, as
this would require the use of mathematics that are beyond our scope. We
refer to the extensive literature on the subject (see the discussion of Section
6.1.6).
6.8.1 Projected Equations - Simulation-Based Versions
We rst focus on general linear xed point equations x = T(x), where
T(x) = b +Ax, (6.239)
A is an n n matrix, and b
n
is a vector. We consider approximations
of a solution by solving a projected equation
r = T(r) = (b +Ar),
where denotes projection with respect to a weighted Euclidean norm
| |
on a subspace
S = r [ r
s
.
We assume throughout that the columns of the ns matrix are linearly
independent basis functions.
Examples are Bellmans equation for policy evaluation, in which case
A = P, where P is a transition matrix (discounted and average cost), or
P is a substochastic matrix (row sums less than or equal to 0, as in SSP),
and = 1 (SSP and average cost), or < 1 (discounted). Other examples
in DP include the semi-Markov problems discussed in Chapter 5. However,
for the moment we do not assume the presence of any stochastic structure
in A. Instead, we assume throughout that I A is invertible, so that the
projected equation has a unique solution denoted r
.
We will derive extensions of LSTD(0), LSPE(0), and TD(0) methods
of Section 6.3 (the latter two under the assumption that T is a con-
traction). References [BeY07] and [BeY09], where these methods were
rst developed, provide extensions of LSTD(), LSPE(), and TD() for
(0, 1); the later two are convergent when T
()
is a contraction on S,
where
T
()
= (1 )
=0
T
+1
,
and T has the general form T(x) = b+Ax of Eq. (6.239) (cf. Section 6.3.6).
Even if T or T are not contractions, we can obtain an error bound
that generalizes some of the bounds obtained earlier. We have
x
= x
+Tx
Tr
= x
+A(x
), (6.240)
from which
x
= (I A)
1
(x
).
Thus, for any norm | | and xed point x
of T,
|x
|
_
_
(I A)
1
_
_
|x
_
_
, (6.241)
so the approximation error |x
| is proportional to the distance of

the solution x
from the approximation subspace. If T is a contraction

mapping of modulus (0, 1) with respect to | |, from Eq. (6.240), we
have
|x
| |x
|+|T(x
)T(r
)| |x
|+|x
|,
so that
|x
|
1
1
|x
|. (6.242)
We rst introduce an equivalent form of the projected equation r =
(b +Ar), which generalizes the matrix form (6.40)-(6.41) for discounted
DP problems. Let us assume that the positive probability distribution
vector is given. By the denition of projection with respect to | |
, the
unique solution r
of this equation satises

r
= arg min
r
s
_
_
r (b +Ar
)
_
_
2
.
Setting to 0 the gradient with respect to r, we obtain the corresponding
orthogonality condition
_
r
(b +Ar
)
_
= 0,
where is the diagonal matrix with the probabilities
1
, . . . ,
n
along the
diagonal. Equivalently,
Cr
= d,
where
C =
(I A), d =
b, (6.243)
and is the diagonal matrix with the components of along the diagonal
[cf. Eqs. (6.40)-(6.41)].
We will now develop a simulation-based approximation to the system
Cr
= d, by using corresponding estimates of C and d. We write C and d

as expected values with respect to :
C =
n
i=1
i
(i)
_
_
(i)
n
j=1
a
ij
(j)
_
_
, d =
n
i=1
i
(i)b
i
. (6.244)
j
0
j
0
j
1 j
1
j
k
j
k
j
k+1
+1
i
0
i
0
i
1
i
1
i
k
i
k
i
k+1
... ...
Row Sampling According to
(May Use Markov Chain Q)
Disaggregation Probabilities Column Sampling According to Markov
Disaggregation Probabilities Column Sampling According to Markov
Markov Chain
Row Sampling According to
) P |A|
Figure 6.8.1 The basic simulation methodology consists of (a) generating a
sequence of indices {i
0
, i
1
, . . .} according to the distribution (a Markov chain Q
may be used for this, but this is not a requirement), and (b) generating a sequence
of transitions
_
(i
0
, j
0
), (i
1
, j
1
), . . .
_
using a Markov chain P. It is possible that
j
k
= i
k+1
, but this is not necessary.
As in Section 6.3.3, we approximate these expected values by simulation-
obtained sample averages, however, here we do not have a Markov chain
structure by which to generate samples. We must therefore design a sam-
pling scheme that can be used to properly approximate the expected values
in Eq. (6.244). In the most basic form of such a scheme, we generate a se-
quence of indices i
0
, i
1
, . . ., and a sequence of transitions between indices
_
(i
0
, j
0
), (i
1
, j
1
), . . .
_
. We use any probabilistic mechanism for this, subject
to the following two requirements (cf. Fig. 6.8.1):
(1) Row sampling: The sequence i
0
, i
1
, . . . is generated according to
the distribution , which denes the projection norm | |
, in the
sense that with probability 1,
lim
k
k
t=0
(i
t
= i)
k + 1
=
i
, i = 1, . . . , n, (6.245)
where () denotes the indicator function [(E) = 1 if the event E has
occurred and (E) = 0 otherwise].
(2) Column sampling: The sequence
_
(i
0
, j
0
), (i
1
, j
1
), . . .
_
is generated
according to a certain stochastic matrix P with transition probabili-
ties p
ij
that satisfy
p
ij
> 0 if a
ij
,= 0, (6.246)
in the sense that with probability 1,
lim
k
k
t=0
(i
t
= i, j
t
= j)
k
t=0
(i
t
= i)
= p
ij
, i, j = 1, . . . , n. (6.247)
At time k, we approximate C and d with
C
k
=
1
k + 1
k
t=0
(i
t
)
_
(i
t
)
a
i
t
j
t
p
i
t
j
t
(j
t
)
_
, d
k
=
1
k + 1
k
t=0
(i
t
)b
i
t
.
(6.248)
To show that this is a valid approximation, similar to the analysis of Section
6.3.3, we count the number of times an index occurs and after collecting
terms, we write Eq. (6.248) as
C
k
=
n
i=1
i,k
(i)
_
_
(i)
n
j=1
p
ij,k
a
ij
p
ij
(j)
_
_
, d
k
=
n
i=1
i,k
(i)b
i
,
(6.249)
where
i,k
=
k
t=0
(i
t
= i)
k + 1
, p
ij,k
=
k
t=0
(i
t
= i, j
t
= j)
k
t=0
(i
t
= i)
;
(cf. the calculations in Section 6.3.3). In view of the assumption
i,k

i
, p
ij,k
p
ij
, i, j = 1, . . . , n,
[cf. Eqs. (6.245) and (6.247)], by comparing Eqs. (6.244) and (6.249), we see
that C
k
C and d
k
d. Since the solution r
of the system (6.244) exists

and is unique, the same is true for the system (6.249) for all t suciently
large. Thus, with probability 1, the solution of the system (6.248) converges
to r
as k .
A comparison of Eqs. (6.244) and (6.249) indicates some considera-
tions for selecting the stochastic matrix P. It can be seen that important
(e.g., large) components a
ij
should be simulated more often (p
ij
: large).
In particular, if (i, j) is such that a
ij
= 0, there is an incentive to choose
p
ij
= 0, since corresponding transitions (i, j) are wasted in that they
do not contribute to improvement of the approximation of Eq. (6.244) by
Eq. (6.249). This suggests that the structure of P should match in some
sense the structure of the matrix A, to improve the eciency of the simu-
lation (the number of samples needed for a given level of simulation error
variance). On the other hand, the choice of P does not aect the limit of
r
k
, which is the solution r
of the projected equation. By contrast, the

choice of aects the projection and hence also r
.
Note that there is a lot of exibility for generating the sequence
i
0
, i
1
, . . . and the transition sequence
_
(i
0
, j
0
), (i
1
, j
1
), . . .
_
to satisfy Eqs.
(6.245) and (6.247). For example, to satisfy Eq. (6.245), the indices i
t
do
not need to be sampled independently according to . Instead, it may be
convenient to introduce an irreducible Markov chain with transition matrix
Q, states 1, . . . , n, and as its steady-state probability vector, and to start
at some state i
0
and generate the sequence i
0
, i
1
, . . . as a single innitely
long trajectory of the chain. For the transition sequence, we may option-
ally let j
k
= i
k+1
for all k, in which case P would be identical to Q, but in
general this is not essential.
Let us discuss two possibilities for constructing a Markov chain with
steady-state probability vector . The rst is useful when a desirable distri-
bution is known up to a normalization constant. Then we can construct
such a chain using techniques that are common in Markov chain Monte
Carlo (MCMC) methods (see e.g., Liu [Liu01], Rubinstein and Kroese
[RuK08]).
The other possibility, which is useful when there is no particularly
desirable , is to specify rst the transition matrix Q of the Markov chain
and let be its steady-state probability vector. Then the requirement
(6.245) will be satised if the Markov chain is irreducible, in which case
will be the unique steady-state probability vector of the chain and will have
positive components. An important observation is that explicit knowledge
of is not required; it is just necessary to know the Markov chain and to
be able to simulate its transitions. The approximate DP applications of
Sections 6.3, 6.6, and 6.7, where Q = P, fall into this context. In the next
section, we will discuss favorable methods for constructing the transition
matrix Q from A, which result in T being a contraction so that iterative
methods are applicable.
Note that multiple simulated sequences can be used to form the equa-
tion (6.248). For example, in the Markov chain-based sampling schemes,
we can generate multiple innitely long trajectories of the chain, starting at
several dierent states, and for each trajectory use j
k
= i
k+1
for all k. This
will work even if the chain has multiple recurrent classes, as long as there
are no transient states and at least one trajectory is started from within
each recurrent class. Again will be a steady-state probability vector of
the chain, and need not be known explicitly. Note also that using multiple
trajectories may be interesting even if there is a single recurrent class, for
at least two reasons:
(a) The generation of trajectories may be parallelized among multiple
processors, resulting in signicant speedup.
(b) The empirical frequencies of occurrence of the states may approach
the steady-state probabilities more quickly; this is particularly so for
large and sti Markov chains.
We nally note that the option of using distinct Markov chains Q and
P for row and column sampling is important in the DP/policy iteration
context. In particular, by using a distribution that is not associated with
P, we may resolve the issue of exploration (see Section 6.3.7).
6.8.2 Matrix Inversion and Regression-Type Methods
Given simulation-based estimates C
k
and d
k
of C and d, respectively, we
may approximate r
= C
1
d with
r
k
= C
1
k
d
k
,
in which case we have r
k
r
with probability 1 (this parallels the LSTD

method of Section 6.3.4). An alternative, which is more suitable for the case
where C
k
is nearly singular, is the regression/regularization-based estimate
r
k
= (C
1
C
k
+ I)
1
(C
1
d
k
+ r), (6.250)
[cf. Eq. (6.58) in Section 6.3.4], where r is an a priori estimate of r
=
C
1
d, is a positive scalar, and is some positive denite symmetric
matrix. The error estimate given by Prop. 6.3.4 applies to this method.
In particular, the error | r
k
r
| is bounded by the sum of two terms:

one due to simulation error (which is larger when C is nearly singular,
and decreases with the amount of sampling used), and the other due to
regularization error (which depends on the regularization parameter and
the error | r r
|); cf. Eq. (6.60).

To obtain a condence interval for the error | r
k
r
|, we view all
variables generated by simulation to be random variables on a common
probability space. Let

k
be the covariance of (d
k
C
k
r
), and let
b
k
=

1/2
k
(d
k
C
k
r
).
Note that

b
k
has covariance equal to the identity. Let

P
k
be the cumulative
distribution function of |
b
k
|
2
, and note that
|
b
k
|
_
P
1
k
(1 ) (6.251)
with probability (1), where

P
1
k
(1) is the threshold value v at which
the probability that |
b
k
|
2
takes value greater than v is . We denote by
P(E) the probability of an event E.
Proposition 6.8.1: We have
P
_
| r
k
r
|
k
(, )
_
1 ,
where
k
(, ) = max
i=1,...,s
_

i
2
i
+
_
_
_
_
1/2
1/2
k
_
_
_
_
P
1
k
(1 )
+ max
i=1,...,s
_

2
i
+
_
| r r
|,
(6.252)
and
1
, . . . ,
s
are the singular values of
1/2
C
k
.
Proof: Let b
k
=
1/2
(d
k
C
k
r
). Following the notation and proof of

Prop. 6.3.4, and using the relation

b
k
=

1/2
k

1/2
b
k
, we have
r
k
r
= V (
2
+I)
1
U
b
k
+ V (
2
+I)
1
V
( r r
)
= V (
2
+I)
1
U
1/2
1/2
k
b
k
+ V (
2
+I)
1
V
( r r
).
From this, we similarly obtain
| r
k
r
| max
i=1,...,s
_

i
2
i
+
_
_
_
_
1/2
1/2
k
_
_
_ |
b
k
|+ max
i=1,...,s
_

2
i
+
_
| rr
|.
Since Eq. (6.251) holds with probability (1 ), the desired result follows.
Q.E.D.
Using a form of the central limit theorem, we may assume that for
a large number of samples,

b
k
asymptotically becomes a Gaussian random
s-dimensional vector, so that the random variable
|
b
k
|
2
= (d
k
C
k
r
1
k
(d
k
C
k
r
)
can be treated as a chi-square random variable with s degrees of freedom
(since the covariance of

b
k
is the identity by denition). Assuming this, the
distribution

P
1
k
(1 ) in Eq. (6.252) is approximately equal and may be
replaced by P
1
(1 ; s), the threshold value v at which the probability
that a chi-square random variable with s degrees of freedom takes value
greater than v is . Thus in a practical application of Prop. 6.8.1, one may
replace

P
1
k
(1 ) by P
1
(1 ; s), and also replace

k
with an estimate
of the covariance of (d
k
C
k
r
); the other quantities in Eq. (6.252) (,

i
,
, and r) are known.
6.8.3 Iterative/LSPE-Type Methods
In this section, we will consider iterative methods for solving the projected
equation Cr = d [cf. Eq. (6.244)], using simulation-based estimates C
k
and
d
k
. We rst consider the xed point iteration
r
k+1
= T(r
k
), k = 0, 1, . . . , (6.253)
which generalizes the PVI method of Section 6.3.2. For this method to
be valid and to converge to r
it is essential that T is a contraction with

respect to some norm. In the next section, we will provide tools for verifying
that this is so.
Similar to the analysis of Section 6.3.3, the simulation-based approx-
imation (LSPE analog) is
r
k+1
=
_
k
t=0
(i
t
)(i
t
)
_
1
k
t=0
(i
t
)
_
a
i
t
j
t
p
i
t
j
t
(j
t
)
r
k
+b
i
t
_
. (6.254)
Here again i
0
, i
1
, . . . is an index sequence and (i
0
, j
0
), (i
1
, j
1
), . . . is a
transition sequence satisfying Eqs. (6.245)-(6.247).
A generalization of this iteration, written in more compact form and
introducing scaling with a matrix G
k
, is given by
r
k+1
= r
k
G
k
(C
k
r
k
d
k
), (6.255)
where C
k
and d
k
are given by Eq. (6.248) [cf. Eq. (6.70)]. As in Section
6.3.4, this iteration can be equivalently written in terms of generalized
temporal dierences as
r
k+1
= r
k

k + 1
G
k
k
t=0
(i
t
)q
k,t
where
q
k,t
= (i
t
)
r
k

a
i
t
j
t
p
i
t
j
t
(j
t
)
r
k
b
i
t
[cf. Eq. (6.71)]. The scaling matrix G
k
should converge to an appropriate
matrix G.
For the scaled LSPE-type method (6.255) to converge to r
, we must
have G
k
G, C
k
C, and G, C, and must be such that I GC is a
contraction. Noteworthy special cases where this is so are:
(a) The case of iteration (6.254), where = 1 and
G
k
=
_
k
t=0
(i
t
)(i
t
)
_
1
,
under the assumption that T is a contraction. The reason is that
this iteration asymptotically becomes the xed point iteration r
k+1
=
T(r
k
) [cf. Eq. (6.253)].
(b) C is positive denite, G is symmetric positive denite, and is suf-
ciently small. This case arises in various DP contexts, e.g., the
discounted problem where A = P (cf. Section 6.3).
(c) C is positive denite, = 1, and G has the form
G = (C +I)
1
,
where is a positive scalar (cf. Section 6.3.4). The corresponding
iteration (6.255) takes the form
r
k+1
= r
k
(C
k
+I)
1
(C
k
r
k
d
k
)
[cf. Eq. (6.76)].
(d) C is invertible, = 1, and G has the form
G = (C
1
C +I)
1
C
1
,
where is some positive denite symmetric matrix, and is a positive
scalar. The corresponding iteration (6.255) takes the form
r
k+1
= (C
1
k
C
k
+I)
1
(C
1
k
d
k
+r
k
)
[cf. Eq. (6.75)]. As shown in Section 6.3.2, the eigenvalues of GC are
i
/(
i
+), where
i
are the eigenvalues of C
1
C, so I GC has
real eigenvalues in the interval (0, 1). This iteration also works if C
is not invertible.
The Analog of TD(0)
Let us also note the analog of the TD(0) method. It is similar to Eq.
(6.255), but uses only the last sample:
r
k+1
= r
k

k
(i
k
)q
k,k
,
where the stepsize
k
must be diminishing to 0. It was shown in [BeY07]
and [BeY09] that if T is a contraction on S with respect to | |
, then
the matrix C of Eq. (6.243) is negative denite, which is what is essentially
needed for convergence of the method to the solution of the projected
equation Cr = d.
Contraction Properties
We will now derive conditions for T to be a contraction, which facilitates
the use of the preceding iterative methods. We assume that the index
sequence i
0
, i
1
, . . . is generated as an innitely long trajectory of a Markov
chain whose steady-state probability vector is . We denote by Q the
corresponding transition probability matrix and by q
ij
the components of
Q. As discussed earlier, Q may not be the same as P, which is used
to generate the transition sequence
_
(i
0
, j
0
), (i
1
, j
1
), . . .
_
to satisfy Eqs.
(6.245) and (6.247). It seems hard to guarantee that T is a contraction
mapping, unless [A[ Q [i.e., [a
ij
[ q
ij
for all (i, j)]. The following
propositions assume this condition.
Proposition 6.8.2: Assume that Q is irreducible and that [A[
Q. Then T and T are contraction mappings under any one of the
following three conditions:
(1) For some scalar (0, 1), we have [A[ Q.
(2) There exists an index i such that [a
ij
[ < q
ij
for all j = 1, . . . , n.
(3) There exists an index i such that
n
j=1
[a
ij
[ < 1.
Proof: For any vector or matrix X, we denote by [X[ the vector or matrix
that has as components the absolute values of the corresponding compo-
nents of X. Let be the steady-state probability vector of Q. Assume
condition (1). Since is nonexpansive with respect to | |
, it will suce
to show that A is a contraction with respect to | |
. We have
[Az[ [A[ [z[ Q[z[, z
n
. (6.256)
Using this relation, we obtain
|Az|
|Q[z[|
|z|
, z
n
, (6.257)
where the last inequality follows since |Qx|
|x|
for all x
n
(see
Lemma 6.3.1). Thus, A is a contraction with respect to | |
with modulus
.
Assume condition (2). Then, in place of Eq. (6.256), we have
[Az[ [A[ [z[ Q[z[, z
n
,
with strict inequality for the row corresponding to i when z ,= 0, and in
place of Eq. (6.257), we obtain
|Az|
< |Q[z[|
|z|
, z ,= 0.
It follows that A is a contraction with respect to | |
, with modulus
max
z
1
|Az|
.
Assume condition (3). It will suce to show that the eigenvalues of
A lie strictly within the unit circle. Let

Q be the matrix which is identical
to Q except for the ith row which is identical to the ith row of [A[. From
the irreducibility of Q, it follows that for any i
1
,= i it is possible to nd a
sequence of nonzero components

Q
i
1
i
2
, . . . ,

Q
i
k1
i
k
,

Q
i
k
i
that lead from
i
1
to i. Using a well-known result, we have

Q
t
0. Since [A[

Q, we
also have [A[
t
0, and hence also A
t
0 (since [A
t
[ [A[
t
). Thus, all
eigenvalues of A are strictly within the unit circle. We next observe that
from the proof argument under conditions (1) and (2), we have
|Az|
|z|
, z
n
,
so the eigenvalues of A cannot lie outside the unit circle.
Assume to arrive at a contradiction that is an eigenvalue of A
with [[ = 1, and let be a corresponding eigenvector. We claim that A
must have both real and imaginary components in the subspace S. If this
were not so, we would have A ,= A, so that
|A|
> |A|
= ||
= [[ ||
= ||
,
which contradicts the fact |Az|
|z|
for all z, shown earlier. Thus,

the real and imaginary components of A are in S, which implies that
A = A = , so that is an eigenvalue of A. This is a contradiction
because [[ = 1, while the eigenvalues of A are strictly within the unit
circle. Q.E.D.
Note that the preceding proof has shown that under conditions (1)
and (2) of Prop. 6.8.2, T and T are contraction mappings with respect
to the specic norm | |
, and that under condition (1), the modulus of

contraction is . Furthermore, Q need not be irreducible under these con-
ditions it is sucient that Q has no transient states (so that it has a
steady-state probability vector with positive components). Under condi-
tion (3), T and T need not be contractions with respect to | |
. For a
counterexample, take a
i,i+1
= 1 for i = 1, . . . , n 1, and a
n,1
= 1/2, with
every other entry of A equal to 0. Take also q
i,i+1
= 1 for i = 1, . . . , n 1,
In the following argument, the projection z of a complex vector z is
obtained by separately projecting the real and the imaginary components of z on
S. The projection norm for a complex vector x +iy is dened by
x +iy
=
_
x
2
+y
2
.
and q
n,1
= 1, with every other entry of Q equal to 0, so
i
= 1/n for all i.
Then for z = (0, 1, . . . , 1)
we have Az = (1, . . . , 1, 0)
and |Az|
= |z|
,
so A is not a contraction with respect to | |
. Taking S to be the entire

space
n
, we see that the same is true for A.
When the row sums of [A[ are no greater than one, one can construct
Q with [A[ Q by adding another matrix to [A[:
Q = [A[ +Diag(e [A[e)R, (6.258)
where R is a transition probability matrix, e is the unit vector that has
all components equal to 1, and Diag(e [A[e) is the diagonal matrix with
1
n
m=1
[a
im
[, i = 1, . . . , n, on the diagonal. Then the row sum decit of
the ith row of A is distributed to the columns j according to fractions r
ij
,
the components of R.
The next proposition uses dierent assumptions than Prop. 6.8.2, and
applies to cases where there is no special index i such that
n
j=1
[a
ij
[ < 1. In
fact A may itself be a transition probability matrix, so that I A need not
be invertible, and the original system may have multiple solutions; see the
subsequent Example 6.8.2. The proposition suggests the use of a damped
version of the T mapping in various methods (compare with Section 6.7
and the average cost case for = 0).
Proposition 6.8.3: Assume that there are no transient states corre-
sponding to Q, that is a steady-state probability vector of Q, and
that [A[ Q. Assume further that I A is invertible. Then the
mapping T
, where
T
= (1 )I +T,
for all (0, 1).

Proof: The argument of the proof of Prop. 6.8.2 shows that the condition
[A[ Q implies that A is nonexpansive with respect to the norm | |
.
Furthermore, since I A is invertible, we have z ,= Az for all z ,= 0.
Hence for all (0, 1) and z
n
,
|(1)z+Az|
< (1)|z|
+|Az|
(1)|z|
+|z|
= |z|
,
(6.259)
where the strict inequality follows from the strict convexity of the norm,
and the weak inequality follows from the nonexpansiveness of A. If we
dene
= sup
_
|(1 )z +Az|
[ |z| 1
_
,
and note that the supremum above is attained by Weierstrass Theorem,
we see that Eq. (6.259) yields
< 1 and
|(1 )z +Az|
|z|
, z
n
.
From the denition of T
, we have for all x, y

n
,
T
x T
y = T
(x y) = (1 )(x y) +A(x y)
= (1 )(x y) +
_
A(x y)
_
,
so dening z = x y, and using the preceding two relations and the non-
expansiveness of , we obtain
|T
x T
y|
= |(1 )z +(Az)|
|(1 )z +Az|
|z|
|x y|
,
for all x, y
n
. Q.E.D.
Note that the mappings T
and T have the same xed points, so

under the assumptions of Prop. 6.8.3, there is a unique xed point r
of
T. We now discuss examples of choices of and Q in some special cases.
Example 6.8.1 (Discounted DP Problems and Exploration)
Bellmans equation for the cost vector of a stationary policy in an n-state
discounted DP problem has the form x = T(x), where
T(x) = Px +g,
g is the vector of single-stage costs associated with the n states, P is the
transition probability matrix of the associated Markov chain, and (0, 1)
is the discount factor. If P is an irreducible Markov chain, and is chosen
to be its unique steady-state probability vector, the matrix inversion method
based on Eq. (6.248) becomes LSTD(0). The methodology of the present
section also allows row sampling/state sequence generation using a Markov
chain P other than P, with an attendant change in , as discussed in the
context of exploration-enhanced methods in Section 6.3.7.
Example 6.8.2 (Undiscounted DP Problems)
Consider the equation x = Ax + b, for the case where A is a substochastic
matrix (aij 0 for all i, j and
n
j=1
aij 1 for all i). Here 1
n
j=1
aij
may be viewed as a transition probability from state i to some absorbing
state denoted 0. This is Bellmans equation for the cost vector of a stationary
policy of a SSP. If the policy is proper in the sense that from any state i = 0
there exists a path of positive probability transitions from i to the absorbing
state 0, the matrix
Q = |A| +Diag(e |A|e)R
[cf. Eq. (6.258)] is irreducible, provided R has positive components. As a
result, the conditions of Prop. 6.8.2 under condition (2) are satised, and T
and T are contractions with respect to
. It is also possible to use a

matrix R whose components are not all positive, as long as Q is irreducible,
in which case Prop. 6.8.2 under condition (3) applies (cf. Prop. 6.7.1).
Consider also the equation x = Ax + b for the case where A is an ir-
reducible transition probability matrix, with steady-state probability vector
. This is related to Bellmans equation for the dierential cost vector of a
stationary policy of an average cost DP problem involving a Markov chain
with transition probability matrix A. Then, if the unit vector e is not con-
tained in the subspace S spanned by the basis functions, the matrix I A is
invertible, as shown in Section 6.7. As a result, Prop. 6.8.3 applies and shows
that the mapping (1 )I +A, is a contraction with respect to
for all
(0, 1) (cf. Section 6.7, Props. 6.7.1, 6.7.2).
The projected equation methodology of this section applies to gen-
eral linear xed point equations, where A need not have a probabilistic
structure. A class of such equations where A is a contraction is given in
the following example, an important case in the eld of numerical meth-
ods/scientic computation where iterative methods are used for solving
linear equations.
Example 6.8.3 (Weakly Diagonally Dominant Systems)
Consider the solution of the system
Cx = d,
where d
n
and C is an n n matrix that is weakly diagonally dominant,
i.e., its components satisfy
cii = 0,
j=i
|cij| |cii|, i = 1, . . . , n. (6.260)
By dividing the ith row by cii, we obtain the equivalent system x = Ax + b,
where the components of A and b are
aij =
_
0 if i = j,
c
ij
c
ii
if i = j,
bi =
di
cii
, i = 1, . . . , n.
Then, from Eq. (6.260), we have
n
j=1
|aij| =
j=i
|cij|
|cii|
1, i = 1, . . . , n,
so Props. 6.8.2 and 6.8.3 may be used under the appropriate conditions. In
particular, if the matrix Q given by Eq. (6.258) has no transient states and
there exists an index i such that
n
j=1
|a
ij
| < 1, Prop. 6.8.2 applies and shows
that T is a contraction.
Alternatively, instead of Eq. (6.260), assume the somewhat more re-
strictive condition
|1 cii| +
j=i
|cij | 1, i = 1, . . . , n, (6.261)
and consider the equivalent system x = Ax +b, where
A = I C, b = d.
Then, from Eq. (6.261), we have
n
j=1
|aij| = |1 cii| +
j=i
|cij| 1, i = 1, . . . , n,
so again Props. 6.8.2 and 6.8.3 apply under appropriate conditions.
Let us nally address the question whether it is possible to nd Q
such that [A[ Q and the corresponding Markov chain has no transient
states or is irreducible. To this end, assume that
n
j=1
[a
ij
[ 1 for all i.
If A is itself irreducible, then any Q such that [A[ Q is also irreducible.
Otherwise, consider the set
I =
_
_
_
i
j=1
[a
ij
[ < 1
_
_
_
,
and assume that it is nonempty (otherwise the only possibility is Q = [A[).
Let

I be the set of i such that there exists a sequence of nonzero components
a
ij
1
, a
j
1
j
2
, . . . , a
jmi
such that i I, and let

I = i [ i / I
I (we allow here

the possibility that

I or

I may be empty). Note that the square submatrix
of [A[ corresponding to

I is a transition probability matrix, and that we
have a
ij
= 0 for all i

I and j /

I. Then it can be shown that there exists
Q with [A[ Q and no transient states if and only if the Markov chain
corresponding to

I has no transient states. Furthermore, there exists an
irreducible Q with [A[ Q if and only if

I is empty.
6.8.4 Multistep Methods
We now consider the possibility of replacing T with a multistep mapping
that has the same xed point, such as T
with > 1, or T
()
given by
T
()
= (1 )
=0
T
+1
,
where (0, 1). For example, the LSTD(), LSPE(), and TD() meth-
ods for approximate policy evaluation are based on this possibility. The
key idea in extending these methods to general linear systems is that the
ith component (A
m
b)(i) of a vector of the form A
m
b, where b
n
, can
be computed by averaging over properly weighted simulation-based sample
values.
In multistep methods, it turns out that for technical eciecy reasons
it is important to use the same probabilistic mechanism for row and for col-
umn sampling. In particular, we generate the index sequence i
0
, i
1
, . . .
and the transition sequence
_
(i
0
, i
1
), (i
1
, i
2
), . . .
_
by using the same irre-
ducible transition matrix P, so is the steady-state probability distribution
of P. We then form the average of w
k,m
b
i
k+m
over all indices k such that
i
k
= i, where
w
k,m
=
_
a
i
k
i
k+1
p
i
k
i
k+1
a
i
k+1
i
k+2
p
i
k+1
i
k+2

a
i
k+m1
i
k+m
p
i
k+m1
i
k+m
if m 1,
1 if m = 0.
(6.262)
We claim that the following is a valid approximation of (A
m
b)(i):
(A
m
b)(i)
t
k=0
(i
k
= i)w
k,m
b
i
k+m
t
k=0
(i
k
= i)
. (6.263)
The justication is that by the irreducibility of the associated Markov chain,
we have
lim
t
t
k=0
(i
k
= i, i
k+1
= j
1
, . . . , i
k+m
= j
m
)
t
k=0
(i
k
= i)
= p
ij
1
p
j
1
j
2
p
j
m1
jm
,
(6.264)
and the limit of the right-hand side of Eq. (6.263) can be written as
lim
t
t
k=0
(i
k
= i)w
k,m
bi
k+m
t
k=0
(i
k
= i)
= lim
t
t
k=0
n
j
1
=1

n
jm=1
(i
k
= i, i
k+1
= j1, . . . , i
k+m
= jm)w
k,m
bi
k+m
t
k=0
(i
k
= i)
=
n
j
1
=1

n
jm=1
lim
t
t
k=0
(i
k
= i, i
k+1
= j1, . . . , i
k+m
= jm)
t
k=0
(i
k
= i)
w
k,m
bi
k+m
=
n
j
1
=1

n
jm=1
aij
1
aj
1
j
2
aj
m1
jm
bjm
= (A
m
b)(i),
where the third equality follows using Eqs. (6.262) and (6.264).
By using the approximation formula (6.263), it is possible to construct
complex simulation-based approximations to formulas that involve powers
of A. As an example that we have not encountered so far in the DP context,
we may obtain by simulation the solution x
of the linear system x = b+Ax,

which can be expressed as
x
= (I A)
1
b =
=0
A
b,
assuming the eigenvalues of A are all within the unit circle. Historically,
this is the rst method for simulation-based matrix inversion and solution of
linear systems, due to von Neumann and Ulam (unpublished but described
by Forsythe and Leibler [FoL50]).
-Methods
We will now summarize extensions of LSTD(), LSPE(), and TD() to
solve the general xed point problem x = b + Ax. The underlying idea is
the approximation of Eq. (6.263). We refer to [BeY09], [Ber11a], [Yu10a],
and [Yu10b] for detailed derivations and analysis. Similar to Section 6.3.6,
these methods aim to solve the -projected equation
r = T
()
(r),
or equivalently
C
()
r = d
()
,
where
C
()
=
_
I A
()
_
, d
()
=
b
()
,
with
A
()
= (1 )
=0
A
+1
, b
()
=
=0
b,
by using simulation-based approximations, and either matrix inversion or
iteration.
As in Sections 7.3.1, the simulation is used to construct approxima-
tions C
()
k
and d
()
k
of C
()
and d
()
, respectively. Given the simulated
sequence i
0
, i
1
, . . . obtained by row/column sampling using transition
probabilities p
ij
, C
()
k
and d
()
k
are generated by
C
()
k
= (1
k
)C
()
k1
+
k
z
k
_
(i
k
)
a
i
k
i
k+1
p
i
k
i
k+1
(i
k+1
)
_
,
d
()
k
= (1
k
)d
()
k1
+
k
z
k
g(i
k
, i
k+1
),
where z
k
are modied eligibility vectors given by
z
k
=
a
i
k1
i
k
p
i
k1
i
k
z
k1
+(i
k
), (6.265)
the initial conditions are z
1
= 0, C
()
1
= 0, d
()
1
= 0, and
k
=
1
k + 1
, k = 0, 1, . . . .
The matrix inversion/LSTD() analog is to solve the equation C
()
k
r =
d
()
k
, while the iterative/LSPE() analog is
r
k+1
= r
k
G
k
(C
()
k
r
k
d
()
k
),
where G
k
is a positive denite scaling matrix and is a positive stepsize.
There is also a generalized version of the TD() method. It has the form
r
k+1
= r
k
+
k
z
k
q
k
(i
k
),
where
k
is a diminishing positive scalar stepsize, z
k
is given by Eq. (6.265),
and q
k
(i
k
) is the temporal dierence analog given by
q
k
(i
k
) = b
i
k
+
a
i
k1
i
k
p
i
k1
i
k
(i
k+1
)
r
k
(i
k
)
r
k
.
6.8.5 Extension of Q-Learning for Optimal Stopping
If the mapping T is nonlinear (as for example in the case of multiple poli-
cies) the projected equation r = T(r) is also nonlinear, and may have
one or multiple solutions, or no solution at all. On the other hand, if T
is a contraction, there is a unique solution. We have seen in Section 6.5.3
a nonlinear special case of projected equation where T is a contraction,
namely optimal stopping. This case can be generalized as we now show.
Let us consider a system of the form
x = T(x) = Af(x) +b, (6.266)
where f :
n

n
is a mapping with scalar function components of the
form f(x) =
_
f
1
(x
1
), . . . , f
n
(x
n
)
_
. We assume that each of the mappings
f
i
: is nonexpansive in the sense that
f
i
(x
i
) f
i
( x
i
)
[x
i
x
i
[, i = 1, . . . , n, x
i
, x
i
. (6.267)
This guarantees that T is a contraction mapping with respect to any norm
| | with the property
|y| |z| if [y
i
[ [z
i
[, i = 1, . . . , n,
whenever A is a contraction with respect to that norm. Such norms include
weighted l
1
and l
norms, the norm | |
, as well as any scaled Euclidean

norm |x| =
Dx, where D is a positive denite symmetric matrix with

nonnegative components. Under the assumption (6.267), the theory of
Section 6.8.2 applies and suggests appropriate choices of a Markov chain
for simulation so that T is a contraction.
As an example, consider the equation
x = T(x) = Pf(x) +b,
where P is an irreducible transition probability matrix with steady-state
probability vector , (0, 1) is a scalar discount factor, and f is a
mapping with components
f
i
(x
i
) = minc
i
, x
i
, i = 1, . . . , n, (6.268)
where c
i
are some scalars. This is the Q-factor equation corresponding
to a discounted optimal stopping problem with states i = 1, . . . , n, and a
choice between two actions at each state i: stop at a cost c
i
, or continue
at a cost b
i
and move to state j with probability p
ij
. The optimal cost
starting from state i is minc
i
, x
i
, where x
is the xed point of T. As a

special case of Prop. 6.8.2, we obtain that T is a contraction with respect
to | |
. Similar results hold in the case where P is replaced by a matrix

A satisfying condition (2) of Prop. 6.8.2, or the conditions of Prop. 6.8.3.
A version of the LSPE-type algorithm for solving the system (6.266),
which extends the method of Section 6.5.3 for optimal stopping, may be
used when T is a contraction. In particular, the iteration
r
k+1
= T(r
k
), k = 0, 1, . . . ,
takes the form
r
k+1
=
_
n
i=1
i
(i)(i)
_
1
n
i=1
i
(i)
_
_
n
j=1
a
ij
f
j
_
(j)
r
k
_
+b
i
_
_
,
and is approximated by
r
k+1
=
_
k
t=0
(i
t
)(i
t
)
_
1
k
t=0
(i
t
)
_
a
i
t
j
t
p
i
t
j
t
f
j
t
_
(j
t
)
r
k
_
+b
i
t
_
. (6.269)
Here, as before, i
0
, i
1
, . . . is a state sequence, and (i
0
, j
0
), (i
1
, j
1
), . . . is
a transition sequence satisfying Eqs. (6.245) and (6.247) with probability
1. The justication of this approximation is very similar to the ones given
so far, and will not be discussed further. Diagonally scaled versions of this
iteration are also possible.
A diculty with iteration (6.269) is that the terms f
j
t
_
(j
t
)
r
k
_
must
be computed for all t = 0, . . . , k, at every step k, thereby resulting in
signicant overhead. The methods to bypass this diculty in the case of
optimal stopping, discussed at the end of Section 6.5.3, can be extended to
the more general context considered here.
Let us nally consider the case where instead of A = P, the matrix
A satises condition (2) of Prop. 6.8.2, or the conditions of Prop. 6.8.3. The
case where
n
j=1
[a
ij
[ < 1 for some index i, and 0 A Q, where Q is an
irreducible transition probability matrix, corresponds to an undiscounted
optimal stopping problem where the stopping state will be reached from all
other states with probability 1, even without applying the stopping action.
In this case, from Prop. 6.8.2 under condition (3), it follows that A is
a contraction with respect to some norm, and hence I A is invertible.
Using this fact, it can be shown by modifying the proof of Prop. 6.8.3 that
the mapping T
, where
T
(x) = (1 )x +T(x)
for all (0, 1). Thus, T
has a
unique xed point, and must be also the unique xed point of T (since
T and T
have the same xed points).

In view of the contraction property of T
, the damped PVI iteration

r
k+1
= (1 )r
k
+T(r
k
),
converges to the unique xed point of T and takes the form
r
k+1
= (1)r
k
+
_
n
i=1
i
(i)(i)
_
1
n
i=1
i
(i)
_
_
n
j=1
a
ij
f
j
_
(j)
r
k
_
+b
i
_
_
As earlier, it can be approximated by the LSPE iteration
r
k+1
= (1)r
k
+
_
k
t=0
(i
t
)(i
t
)
_
1
k
t=0
(i
t
)
_
a
i
t
j
t
p
i
t
j
t
f
j
t
_
(j
t
)
r
k
_
+b
i
t
_
[cf. Eq. (6.269)].
6.8.6 Bellman Equation Error-Type Methods
We will now consider an alternative approach for approximate solution of
the linear equation x = T(x) = b + Ax, based on nding a vector r that
minimizes
|r T(r)|
2
,
or
n
i=1
i
_
_
(i)
r
n
j=1
a
ij
(j)
r b
i
_
_
2
,
where is a distribution with positive components. In the DP context
where the equation x = T(x) is the Bellman equation for a xed policy, this
is known as the Bellman equation error approach (see [BeT96], Section 6.10
for a detailed discussion of this case, and the more complicated nonlinear
case where T involves minimization over multiple policies). We assume
that the matrix (I A) has rank s, which guarantees that the vector r
that minimizes the weighted sum of squared errors is unique.

We note that the equation error approach is related to the projected
equation approach. To see this, consider the case where is the uniform
distribution, so the problem is to minimize
_
_
r (b +Ar)
_
_
2
, (6.270)
where | | is the standard Euclidean norm. By setting the gradient to 0,
we see that a necessary and sucient condition for optimality is
(I A)
_
r
T(r
)
_
= 0,
or equivalently,
_
r

T(r
)
_
= 0,
where
T(x) = T(x) +A
_
x T(x)
_
.
Thus minimization of the equation error (6.270) is equivalent to solving the
projected equation
r =
T(r),
where denotes projection with respect to the standard Euclidean norm. A
similar conversion is possible when is a general distribution with positive
components.
Error bounds analogous to the projected equation bounds of Eqs.
(6.241) and (6.242) can be developed for the equation error approach, as-
suming that I A is invertible and x
is the unique solution. In particular,

let r minimize |r T(r)|
2
. Then
x
r = Tx
T( r) +T( r) r = A(x
r) +T( r) r,
so that
x
r = (I A)
1
_
T( r) r
_
.
Thus, we obtain
|x
r|

_
_
(I A)
1
_
_
| r T( r)|
_
_
(I A)
1
_
_
_
_
x
T(x
)
_
_
=
_
_
(I A)
1
_
_
_
_
x
+Tx
T(x
)
_
_
=
_
_
(I A)
1
_
_
_
_
(I A)(x
)
_
_
_
_
(I A)
1
_
_
|I A|
|x
,
where the second inequality holds because r minimizes |r T(r)|
2
. In
the case where T is a contraction mapping with respect to the norm | |
,
with modulus (0, 1), a similar calculation yields
|x
r|

1 +
1
|x
.
The vector r
that minimizes |r T(r)|

2
satises the correspond-

ing necessary optimality condition
n
i=1
i
_
_
(i)
n
j=1
a
ij
(j)
_
_
_
_
(i)
n
j=1
a
ij
(j)
_
_
=
n
i=1
i
_
_
(i)
n
j=1
a
ij
(j)
_
_
b
i
.
(6.271)
To obtain a simulation-based approximation to Eq. (6.271), without re-
quiring the calculation of row sums of the form
n
j=1
a
ij
(j), we intro-
duce an additional sequence of transitions (i
0
, j
0
), (i
1
, j
1
), . . . (see Fig.
6.8.2), which is generated according to the transition probabilities p
ij
of the
Markov chain, and is independent of the sequence (i
0
, j
0
), (i
1
, j
1
), . . .
in the sense that with probability 1,
lim
t
t
k=0
(i
k
= i, j
k
= j)
t
k=0
(i
k
= i)
= lim
t
t
k=0
(i
k
= i, j
k
= j)
t
k=0
(i
k
= i)
= p
ij
, (6.272)
for all i, j = 1, . . . , n, and
lim
t
t
k=0
(i
k
= i, j
k
= j, j
k
= j
t
k=0
(i
k
= i)
= p
ij
p
ij
, (6.273)
for all i, j, j
= 1, . . . , n. At time t, we form the linear equation

t
k=0
_
(i
k
)
a
i
k
j
k
p
i
k
j
k
(j
k
)
_
_
(i
k
)
a
i
k
j
k
p
i
k
j
k
(j
k
)
_
r
=
t
k=0
_
(i
k
)
a
i
k
j
k
p
i
k
j
k
(j
k
)
_
b
i
k
.
(6.274)
Similar to our earlier analysis, it can be seen that this is a valid approxi-
mation to Eq. (6.271).
Note a disadvantage of this approach relative to the projected equa-
tion approach (cf. Section 6.8.1). It is necessary to generate two sequences
of transitions (rather than one). Moreover, both of these sequences enter
j
0
j
0
j
1
j
k
j
k
j
k+1
j
0
j
0
j
1 j
1
j
k k
j
k+1
+1
i
0
i
0
i
1
i
1
i
k
i
k
i
k+1
... ...
Figure 6.8.2 A possible simulation mechanism for minimizing the equation er-
ror norm [cf. Eq. (6.274)]. We generate a sequence of states {i
0
, i
1
, . . .} according
to the distribution , by simulating a single innitely long sample trajectory of
the chain. Simultaneously, we generate two independent sequences of transitions,
{(i
0
, j
0
), (i
1
, j
1
), . . .} and {(i
0
, j
0
), (i
1
, j
1
), . . .}, according to the transition prob-
abilities p
ij
, so that Eqs. (6.272) and (6.273) are satised.
Eq. (6.274), which thus contains more simulation noise than its projected
equation counterpart [cf. Eq. (6.248)].
Let us nally note that the equation error approach can be general-
ized to yield a simulation-based method for solving the general linear least
squares problem
min
r
n
i=1
i
_
_
c
i

m
j=1
q
ij
(j)
r
_
_
2
,
where q
ij
are the components of an nm matrix Q, and c
i
are the compo-
nents of a vector c
n
. In particular, one may write the corresponding
optimality condition [cf. Eq. (6.271)] and then approximate it by simulation
[cf. Eq. (6.274)]; see [BeY09], and [WPB09], [PWB09], which also discuss
a regression-based approach to deal with nearly singular problems (cf. the
regression-based LSTD method of Section 6.3.4). Conversely, one may con-
sider a selected set I of states of moderate size, and nd r
that minimizes
the sum of squared Bellman equation errors only for these states:
r
arg min
r
s
iI
i
_
_
(i)
r
n
j=1
a
ij
(j)
r b
i
_
_
2
.
This least squares problem may be solved by conventional (non-simulation)
methods.
An interesting question is how the approach of this section compares
with the projected equation approach in terms of approximation error. No
denitive answer seems possible, and examples where one approach gives
better results than the other have been constructed. Reference [Ber95]
shows that in the example of Exercise 6.9, the projected equation approach
gives worse results. For an example where the projected equation approach
may be preferable, see Exercise 6.11.
Approximate Policy Iteration with Bellman Equation
Error Evaluation
When the Bellman equation error approach is used in conjunction with
approximate policy iteration in a DP context, it is susceptible to chattering
and oscillation just as much as the projected equation approach (cf. Section
6.3.8). The reason is that both approaches operate within the same greedy
partition, and oscillate when there is a cycle of policies
k
,
k+1
, . . . ,
k+m
with
r
k R
k+1, r
k+1 R
k+2, . . . , r
k+m1 R
k+m, r
k+m R
k
(cf. Fig. 6.3.4). The only dierence is that the weight vector r
of a policy
is calculated dierently (by solving a least-squares Bellman error problem
versus solving a projected equation). In practice the weights calculated by
the two approaches may dier somewhat, but generally not enough to cause
dramatic changes in qualitative behavior. Thus, much of our discussion of
optimistic policy iteration in Sections 6.3.5-6.3.6 applies to the Bellman
equation error approach as well.
Example 6.3.2 (continued)
Let us return to Example 6.3.2 where chattering occurs when r is evaluated
using the projected equation. When the Bellman equation error approach
is used instead, the greedy partition remains the same (cf. Fig. 6.3.6), the
weight of policy is r = 0 (as in the projected equation case), and for p 1,
the weight of policy
can be calculated to be
r

c
(1 )
_
(1 )
2
+ (2 )
2
_
[which is almost the same as the weight c/(1 ) obtained in the projected
equation case]. Thus with both approaches we have oscillation between
and
in approximate policy iteration, and chattering in optimistic versions,

with very similar iterates.
6.8.7 Oblique Projections
Some of the preceding methodology regarding projected equations can be
generalized to the case where the projection operator is oblique (i.e., it
is not a projection with respect to the weighted Euclidean norm, see e.g.,
Saad [Saa03]). Such projections have the form
= (
)
1
, (6.275)
where as before, is the diagonal matrix with the components
1
, . . . ,
n
of
a positive distribution vector along the diagonal, is an n s matrix of
rank s, and is an ns matrix of rank s. The earlier case corresponds to
= . Two characteristic properties of as given by Eq. (6.275) are that
its range is the subspace S = r [ r
s
and that it is idempotent, i.e.,
2
= . Conversely, a matrix with these two properties can be shown
to have the form (6.275) for some n s matrix of rank s and a diagonal
matrix with the components
1
, . . . ,
n
of a positive distribution vector
along the diagonal. Oblique projections arise in a variety of interesting
contexts, for which we refer to the literature.
Let us now consider the generalized projected equation
r = T(r) = (b +Ar). (6.276)
Using Eq. (6.275) and the fact that has rank s, it can be written as
r = (
)
1
(b +Ar),
or equivalently
r =
(b +Ar), which can be nally written as

Cr = d,
where
C =
(I A), d =
b. (6.277)
These equations should be compared to the corresponding equations for
the Euclidean projection case where = [cf. Eq. (6.243)].
It is clear that row and column sampling can be adapted to provide
simulation-based estimates C
k
and d
k
of C and d, respectively. The corre-
sponding equations have the form [cf. Eq. (6.248)]
C
k
=
1
k + 1
k
t=0
(i
t
)
_
(i
t
)
a
i
t
j
t
p
i
t
j
t
(j
t
)
_
, d
k
=
1
k + 1
k
t=0
(i
t
)b
i
t
,
(6.278)
where
(i) is the ith row of . The sequence of vectors C

1
k
d
k
converges
with probability one to the solution C
1
d of the projected equation, as-
suming that C is nonsingular. For cases where C
k
is nearly singular, the
regression/regularization-based estimate (6.250) may be used. The corre-
sponding iterative method is
r
k+1
= (C
1
k
C
k
+I)
1
(C
1
k
d
k
+r
k
),
and can be shown to converge with probability one to C
1
d.
An example where oblique projections arise in DP is aggregation/dis-
cretization with a coarse grid [cases (c) and (d) in Section 6.4, with the ag-
gregate states corresponding some distinct representative states x
1
, . . . , x
s
of the original problem; also Example 6.4.1]. Then the aggregation equa-
tion for a discounted problem has the form
r = D(b +Pr), (6.279)
where the rows of D are unit vectors (have a single component equal to
1, corresponding to a representative state, and all other components equal
to 0), and the rows of are probability distributions, with the rows corre-
sponding to the representative states x
k
having a single unit component,
x
k
x
k
= 1, k = 1, . . . , s. Then the matrix D can be seen to be the
identity, so we have D D = D and it follows that D is an oblique
projection. The conclusion is that the aggregation equation (6.279) in the
special case of coarse grid discretization is the projected equation (6.276),
with the oblique projection = D.
6.8.8 Generalized Aggregation by Simulation
We will nally discuss the simulation-based iterative solution of a general
system of equations of the form
r = DT(r), (6.280)
where T :
n

m
is a (possibly nonlinear) mapping, D is an s m
matrix, and is an n s matrix. In the case m = n, we can regard the
system (6.280) as an approximation to a system of the form
x = T(x). (6.281)
In particular, the variables x
i
of the system (6.281) are approximated by
linear combinations of the variables r
j
of the system (6.280), using the rows
of . Furthermore, the components of the mapping DT are obtained by
linear combinations of the components of T, using the rows of D. Thus
we may view the system (6.280) as being obtained by aggregation/linear
combination of the variables and the equations of the system (6.281).
We have encountered equations of the form (6.280) in our discussion
of aggregation (Section 6.4) and Q-learning (Section 6.5). For example, the
aggregation mapping
(FR)(x) =
n
i=1
d
xi
min
uU(i)
n
j=1
p
ij
(u)
_
_
g(i, u, j) +
yS
jy
R(y)
_
_
, x S,
(6.282)
[cf. Eq. (6.161)] is of the form (6.280), where r = R, the dimension s is
equal to the number of aggregate states x, m = n is the number of states
i, and the matrices D and consist of the disaggregation and aggregation
probabilities, respectively.
As another example the Q-learning mapping
(FQ)(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) + min
vU(j)
Q(j, v)
_
, (i, u),
(6.283)
[cf. Eq. (6.179)] is of the form (6.280), where r = Q, the dimensions s and
n are equal to the number of state-control pairs (i, u), the dimension m
is the number of state-control-next state triples (i, u, j), the components
of D are the appropriate probabilities p
ij
(u), is the identity, and T is
the nonlinear mapping that transforms Q to the vector with a component
g(i, u, j) +min
vU(j)
Q(j, v) for each (i, u, j).
As a third example, consider the following Bellmans equation over
the space of post-decision states m [cf. Eq. (6.11)]:
V (m) =
n
j=1
q(m, j) min
uU(j)
_
g(j, u) +V
_
f(j, u)
_
_
, m. (6.284)
This equation is of the form (6.280), where r = V , the dimension s is equal
to the number of post-decision states x, m = n is the number of (pre-
decision) states i, the matrix D consists of the probabilities q(m, j), and
is the identity matrix.
There are also versions of the preceding examples, which involve eval-
uation of a single policy, in which case there is no minimization in Eqs.
(6.283)-(6.284), and the corresponding mapping T is linear. We will now
consider separately cases where T is linear and where T is nonlinear. For
the linear case, we will give an LSTD-type method, while for the nonlinear
case (where the LSTD approach does not apply), we will discuss iterative
methods under some contraction assumptions on T, D, and .
The Linear Case
Let T be linear, so the equation r = DT(r) has the form
r = D(b +Ar), (6.285)
where A is an mn matrix, and b
s
. We can thus write this equation
as
Er = f,
where
E = I DA, f = Db.
To interpret the system (6.285), note that the matrix A is obtained by
replacing the n columns of A by s weighted sums of columns of A, with the
weights dened by the corresponding columns of . The matrix DA is
obtained by replacing the m rows of A by s weighted sums of rows of A,
with the weights dened by the corresponding rows of D. The simplest
case is to form DA by discarding n s columns and ms rows of A.
As in the case of projected equations (cf. Section 6.8.1), we can use
low-dimensional simulation to approximate E and f based on row and
column sampling. One way to do this is to introduce for each index i =
1, . . . , n, a distribution p
ij
[ j = 1, . . . , m with the property
p
ij
> 0 if a
ij
,= 0,
and to obtain a sample sequence
_
(i
0
, j
0
), (i
1
, j
1
), . . .
_
. We do so by rst
generating a sequence of row indices i
0
, i
1
, . . . through sampling according
to some distribution
i
[ i = 1, . . . , m, and then by generating for each t
the column index j
t
by sampling according to the distribution p
i
t
j
[ j =
1, . . . , n. There are also alternative schemes, in which we rst sample rows
of D and then generate rows of A, along the lines discussed in Section 6.4.2
(see also Exercise 6.14).
Given the rst k + 1 samples, we form the matrix

E
k
and vector

f
k
given by
E
k
= I
1
k + 1
k
t=0
a
i
t
j
t
i
t
p
i
t
j
t
d(i
t
)(j
t
)
,

f
k
=
1
k + 1
k
t=0
1
i
t
d(i
t
)b
t
,
where d(i) is the ith column of D and (j)
is the jth row of . By using

the expressions
E = I
m
i=1
n
j=1
a
ij
d(i)(j)
, f =
m
i=1
d(i)b
i
,
and law of large numbers arguments, it can be shown that

E
k
E and
f
k
f, similar to the case of projected equations. In particular, we can
write
f
k
=
m
i=1
k
t=0
(i
t
= i)
k + 1
1
i
d(i)b
i
,
and since
k
t=0
(i
t
= i)
k + 1

i
,
we have
f
k

m
i=1
d(i)b
i
= Db.
Similarly, we can write
1
k + 1
k
t=0
a
i
t
j
t
p
i
t
j
t
d(i
t
)(j
t
)
= m
m
i=1
n
j=1
k
t=0
(i
t
= i, j
t
= j)
k + 1
a
ij
i
p
ij
d(i)(j)
,
and since
k
t=0
(i
t
= i, j
t
= j)
k + 1

i
p
ij
,
we have
E
k

m
i=1
n
j=1
a
ij
d(i)(j)
= E.
The convergence

E
k
E and

f
k
f implies in turn that

E
1
k
f
k
converges
to the solution of the equation r = D(b +Ar). There is also a regression-
based version of this method that is suitable for the case where

E
k
is nearly
singular (cf. Section 6.3.4), and an iterative LSPE-type method that works
even when

E
k
is singular [cf. Eq. (6.75)].
The Nonlinear Case
Consider now the case where T is nonlinear and has the contraction prop-
erty
|T(x) T(x)|
|x x|
, x
m
,
where is a scalar with 0 < < 1 and | |
denotes the sup-norm.

Furthermore, let the components of the matrices D and satisfy
m
i=1
[d
i
[ 1, = 1, . . . , s,
and
s
=1
[
j
[ 1, j = 1, . . . , n.
These assumptions imply that D and are nonexpansive in the sense that
|Dx|
|x|
, x
n
,
|y|
|y|
, y
s
,
so that DT is a sup-norm contraction with modulus , and the equation
r = DT(r) has a unique solution, denoted r
.
The ideas underlying the Q-learning algorithm and its analysis (cf.
Section 6.5.1) can be extended to provide a simulation-based algorithm for
solving the equation r = DT(r). This algorithm contains as a special case
the iterative aggregation algorithm (6.168), as well as other algorithms of
interest in DP, such as for example Q-learning and aggregation-type algo-
rithms for stochastic shortest path problems, and for problems involving
post-decision states.
As in Q-learning, the starting point of the algorithm is the xed point
iteration
r
k+1
= DT(r
k
).
This iteration is guaranteed to converge to r
, and the same is true for asyn-

chronous versions where only one component of r is updated at each itera-
tion (this is due to the sup-norm contraction property of DT). To obtain
a simulation-based approximation of DT, we introduce an s m matrix D
whose rows are m-dimensional probability distributions with components
d
i
satisfying
d
i
> 0 if d
i
,= 0, = 1, . . . , s, i = 1, . . . , m.
The th component of the vector DT(r) can be written as an expected
value with respect to this distribution:
m
i=1
d
i
T
i
(r) =
m
i=1
d
i
_
d
i
d
i
T
i
(r)
_
, (6.286)
where T
i
is the ith component of T. This expected value is approximated
by simulation in the algorithm that follows.
The algorithm generates a sequence of indices
0
,
1
, . . . according
to some mechanism that ensures that all indices = 1, . . . , s, are generated
innitely often. Given
k
, an index i
k
1, . . . , m is generated according
to the probabilities d
k
i
, independently of preceding indices. Then the com-
ponents of r
k
, denoted r
k
(), = 1, . . . , s, are updated using the following
iteration:
r
k+1
() =
_
(1
k
)r
k
() +
k
d
i
k
d
i
k
T
i
k
(r
k
) if =
k
,
r
k
() if ,=
k
,
where
k
> 0 is a stepsize that diminishes to 0 at an appropriate rate. Thus
only the
k
th component of r
k
is changed, while all other components are
left unchanged. The stepsize could be chosen to be
k
= 1/n
k
, where as in
Sec. 6.9 Approximation in Policy Space 509
Section 6.5.1, n
k
is the number of times that index
k
has been generated
within the sequence
0
,
1
, . . . up to time k.
The algorithm is similar and indeed contains as a special case the
Q-learning algorithm (6.180)-(6.181). The justication of the algorithm
follows closely the one given for Q-learning in Section 6.5.1. Basically, we
replace the expected value in the expression (6.286) of the th component of
DT, with a Monte Carlo estimate based on all the samples up to time k that
involve
k
, and we then simplify the hard-to-calculate terms in the resulting
method [cf. Eqs. (6.189) and (6.191)]. A rigorous convergence proof requires
the theoretical machinery of stochastic approximation algorithms.
6.9 APPROXIMATION IN POLICY SPACE
Our approach so far in this chapter has been to use an approximation ar-
chitecture for some cost function, dierential cost, or Q-factor. Sometimes
this is called approximation in value space, to indicate that a cost or value
function is being approximated. In an important alternative, called ap-
proximation in policy space, we parameterize the set of policies by a vector
r = (r
1
, . . . , r
s
) and we optimize the cost over this vector. In particular, we
consider randomized stationary policies of a given parametric form
u
(i, r),
where
u
(i, r) denotes the probability that control u is applied when the
state is i. Each value of r denes a randomized stationary policy, which
in turn denes the cost of interest as a function of r. We then choose r to
minimize this cost.
In an important special case of this approach, the parameterization of
the policies is indirect, through an approximate cost function. In particu-
lar, a cost approximation architecture parameterized by r, denes a policy
dependent on r via the minimization in Bellmans equation. For example,
Q-factor approximations

Q(i, u, r), dene a parameterization of policies by
letting
u
(i, r) = 1 for some u that minimizes

Q(i, u, r) over u U(i),
and
u
(i, r) = 0 for all other u. This parameterization is discontinuous in
r, but in practice is it smoothed by replacing the minimization operation
with a smooth exponential-based approximation; we refer to the literature
for the details. Also in a more abstract and general view of approxima-
tion in policy space, rather than parameterizing policies or Q-factors, we
can simply parameterize by r the problem data (stage costs and transition
probabilities), and optimize the corresponding cost function over r. Thus,
in this more general formulation, we may aim to select some parameters of
a given system to optimize performance.
Once policies are parameterized in some way by a vector r, the cost
function of the problem, over a nite or innite horizon, is implicitly pa-
rameterized as a vector

J(r). A scalar measure of performance may then
be derived from

J(r), e.g., the expected cost starting from a single initial
state, or a weighted sum of costs starting from a selected set of states. The
method of optimization may be any one of a number of possible choices,
ranging from random search to gradient methods. This method need not
relate to DP, although DP calculations may play a signicant role in its
implementation. Traditionally, gradient-type methods have received most
attention within this context, but they often tend to be slow and to have
diculties with local minima. On the other hand, random search methods,
such as the cross-entropy method [RuK04], are often very easy to imple-
ment and on occasion have proved surprisingly eective (see the literature
cited in Section 6.10).
In this section, we will focus on the nite spaces average cost problem
and gradient-type methods. Let the cost per stage vector and transition
probability matrix be given as functions of r: G(r) and P(r), respectively.
Assume that the states form a single recurrent class under each P(r), and
let (r) be the corresponding steady-state probability vector. We denote
by G
i
(r), P
ij
(r), and
i
(r) the components of G(r), P(r), and (r), respec-
tively. Each value of r denes an average cost (r), which is common for
all initial states (cf. Section 4.2), and the problem is to nd
min
r
s
(r).
Assuming that (r) is dierentiable with respect to r (something that must
be independently veried), one may use a gradient method for this mini-
mization:
r
k+1
= r
k

k
(r
k
),
where
k
is a positive stepsize. This is known as a policy gradient method.
6.9.1 The Gradient Formula
We will now show that a convenient formula for the gradients (r) can
be obtained by dierentiating Bellmans equation
(r) +h
i
(r) = G
i
(r) +
n
j=1
P
ij
(r)h
j
(r), i = 1, . . . , n, (6.287)
with respect to the components of r, where h
i
(r) are the dierential costs.
Taking the partial derivative with respect to r
m
, we obtain for all i and m,
r
m
+
h
i
r
m
=
G
i
r
m
+
n
j=1
P
ij
r
m
h
j
+
n
j=1
P
ij
h
j
r
m
.
(In what follows we assume that the partial derivatives with respect to
components of r appearing in various equations exist. The argument at
which they are evaluated, is often suppressed to simplify notation.) By
multiplying this equation with
i
(r), adding over i, and using the fact
n
i=1
i
(r) = 1, we obtain
r
m
+
n
i=1
i
h
i
r
m
=
n
i=1
i
G
i
r
m
+
n
i=1
i
n
j=1
P
ij
r
m
h
j
+
n
i=1
i
n
j=1
P
ij
h
j
r
m
.
The last summation on the right-hand side cancels the last summation on
the left-hand side, because from the dening property of the steady-state
probabilities, we have
n
i=1
i
n
j=1
P
ij
h
j
r
m
=
n
j=1
_
n
i=1
i
P
ij
_
h
j
r
m
=
n
j=1
j
h
j
r
m
.
We thus obtain
(r)
r
m
=
n
i=1
i
(r)
_
_
G
i
(r)
r
m
+
n
j=1
P
ij
(r)
r
m
h
j
(r)
_
_
, m = 1, . . . , s,
(6.288)
or in more compact form
(r) =
n
i=1
i
(r)
_
_
G
i
(r) +
n
j=1
P
ij
(r)h
j
(r)
_
_
, (6.289)
where all the gradients are column vectors of dimension s.
6.9.2 Computing the Gradient by Simulation
Despite its relative simplicity, the gradient formula (6.289) involves formida-
ble computations to obtain (r) at just a single value of r. The reason
is that neither the steady-state probability vector (r) nor the bias vector
h(r) are readily available, so they must be computed or approximated in
some way. Furthermore, h(r) is a vector of dimension n, so for large n,
it can only be approximated either through its simulation samples or by
using a parametric architecture and an algorithm such as LSPE or LSTG
(see the references cited at the end of the chapter).
The possibility to approximate h using a parametric architecture ush-
ers a connection between approximation in policy space and approximation
in value space. It also raises the question whether approximations intro-
duced in the gradient calculation may aect the convergence guarantees
of the policy gradient method. Fortunately, however, gradient algorithms
tend to be robust and maintain their convergence properties, even in the
presence of signicant error in the calculation of the gradient.
In the literature, algorithms where both and h are parameterized
are sometimes called actor-critic methods. Algorithms where just is
parameterized and h is not parameterized but rather estimated explicitly
or implicitly by simulation, are called actor-only methods, while algorithms
where just h is parameterized and is obtained by one-step lookahead
minimization, are called critic-only methods.
We will now discuss some possibilities of using simulation to approx-
imate (r). Let us introduce for all i and j such that P
ij
(r) > 0, the
function
L
ij
(r) =
P
ij
(r)
P
ij
(r)
.
Then, suppressing the dependence on r, we write the partial derivative
formula (6.289) in the form
=
n
i=1
i
_
_
G
i
+
n
j=1
P
ij
L
ij
h
j
_
_
. (6.290)
We assume that for all states i and possible transitions (i, j), we can cal-
culate G
i
and L
ij
. Suppose now that we generate a single innitely long
simulated trajectory (i
0
, i
1
, . . .). We can then estimate the average cost
as
=
1
k
k1
t=0
G
i
t
,
where k is large. Then, given an estimate , we can estimate the bias
components h
j
by using simulation-based approximations to the formula
h
i
0
= lim
N
E
_
N
t=0
(G
i
t
)
_
,
[which holds from general properties of the bias vector when P(r) is ape-
riodic see the discussion following Prop. 4.1.2]. Alternatively, we can
estimate h
j
by using the LSPE or LSTD algorithms of Section 6.7.1 [note
here that if the feature subspace contains the bias vector, the LSPE and
LSTD algorithms will nd exact values of h
j
in the limit, so with a su-
ciently rich set of features, an asymptotically exact calculation of h
j
, and
hence also (r), is possible]. Finally, given estimates and

h
j
, we can
estimate the gradient with a vector
given by
=
1
k
k1
t=0
_
G
i
t
+L
i
t
i
t+1
h
i
t+1
_
. (6.291)
This can be seen by a comparison of Eqs. (6.290) and (6.291): if we replace
the expected values of G
i
and L
ij
by empirical averages, and we replace
h
j
by

h
j
, we obtain the estimate
.
The estimation-by-simulation procedure outlined above provides a
conceptual starting point for more practical gradient estimation methods.
For example, in such methods, the estimation of and h
j
may be done
simultaneously with the estimation of the gradient via Eq. (6.291), and
with a variety of dierent algorithms. We refer to the literature cited at
the end of the chapter.
6.9.3 Essential Features of Critics
We will now develop an alternative (but mathematically equivalent) expres-
sion for the gradient (r) that involves Q-factors instead of dierential
costs. Let us consider randomized policies where
u
(i, r) denotes the prob-
ability that control u is applied at state i. We assume that
u
(i, r) is
dierentiable with respect to r for each i and u. Then the corresponding
stage costs and transition probabilities are given by
G
i
(r) =
uU(i)

u
(i, r)
n
j=1
p
ij
(u)g(i, u, j), i = 1, . . . , n,
P
ij
(r) =
uU(i)

u
(i, r)p
ij
(u), i, j = 1, . . . , n.
Dierentiating these equations with respect to r, we obtain
G
i
(r) =
uU(i)

u
(i, r)
n
j=1
p
ij
(u)g(i, u, j), (6.292)
P
ij
(r) =
uU(i)

u
(i, r)p
ij
(u), i, j = 1, . . . , n. (6.293)
Since
uU(i)

u
(i, r) = 1 for all r, we have
uU(i)

u
(i, r) = 0, so Eq.
(6.292) yields
G
i
(r) =
uU(i)

u
(i, r)
_
_
n
j=1
p
ij
(u)g(i, u, j) (r)
_
_
.
Also, by multiplying with h
j
(r) and adding over j, Eq. (6.293) yields
n
j=1
P
ij
(r)h
j
(r) =
n
j=1
uU(i)

u
(i, r)p
ij
(u)h
j
(r).
By using the preceding two equations to rewrite the gradient formula
(6.289), we obtain
(r) =
n
i=1
i
(r)
_
_
G
i
(r) +
n
j=1
P
ij
(r)h
j
(r)
_
_
=
n
i=1
i
(r)
uU(i)

u
(i, r)
n
j=1
p
ij
(u)
_
g(i, u, j) (r) +h
j
(r)
_
,
and nally
(r) =
n
i=1
uU(i)
i
(r)

Q(i, u, r)
u
(i, r), (6.294)
where

Q(i, u, r) are the approximate Q-factors corresponding to r:
Q(i, u, r) =
n
j=1
p
ij
(u)
_
g(i, u, j) (r) +h
j
(r)
_
.
Let us now express the formula (6.294) in a way that is amenable to
proper interpretation. In particular, by writing
(r) =
n
i=1
{uU(i)| u(i,r)>0}
i
(r)
u
(i, r)

Q(i, u, r)

u
(i, r)

u
(i, r)
,
and by introducing the function
r
(i, u) =

u
(i, r)

u
(i, r)
,
we obtain
(r) =
n
i=1
{uU(i)| u(i,r)>0}
r
(i, u)

Q(i, u, r)
r
(i, u), (6.295)
where
r
(i, u) are the steady-state probabilities of the pairs (i, u) under r:
r
(i, u) =
i
(r)
u
(i, r).
Note that for each (i, u),
r
(i, u) is a vector of dimension s, the dimension
of the parameter vector r. We denote by
m
r
(i, u), m = 1, . . . , s, the
components of this vector.
Equation (6.295) can form the basis for policy gradient methods that
estimate

Q(i, u, r) by simulation, thereby leading to actor-only algorithms.
An alternative suggested by Konda and Tsitsiklis [KoT99], [KoT03], is to
interpret the formula as an inner product, thereby leading to a dierent set
of algorithms. In particular, for a given r, we dene the inner product of
two real-valued functions Q
1
, Q
2
of (i, u), by
Q
1
, Q
2
)
r
=
n
i=1
{uU(i)| u(i,r)>0}
r
(i, u)Q
1
(i, u)Q
2
(i, u).
With this notation, we can rewrite Eq. (6.295) as
(r)
r
m
=
Q(, , r),
m
r
(, ))
r
, m = 1, . . . , s.
An important observation is that although (r) depends on

Q(i, u, r),
which has a number of components equal to the number of state-control
pairs (i, u), the dependence is only through its inner products with the s
functions
m
r
(, ), m = 1, . . . , s.
Now let | |
r
be the norm induced by this inner product, i.e.,
|Q|
2
r
= Q, Q)
r
.
Let also S
r
be the subspace that is spanned by the functions
m
r
(, ),
m = 1, . . . , s, and let
r
denote projection with respect to this norm onto
S
r
. Since
Q(, , r),
m
r
(, ))
r
=
r

Q(, , r),
m
r
(, ))
r
, m = 1, . . . , s,
it is sucient to know the projection of

Q(, , r) onto S
r
in order to compute
(r). Thus S
r
denes a subspace of essential features, i.e., features the
knowledge of which is essential for the calculation of the gradient (r).
As discussed in Section 6.1, the projection of

Q(, , r) onto S
r
can be done
in an approximate sense with TD(), LSPE(), or LSTD() for 1. We
refer to the papers by Konda and Tsitsiklis [KoT99], [KoT03], and Sutton,
McAllester, Singh, and Mansour [SMS99] for further discussion.
6.9.4 Approximations in Policy and Value Space
Let us now provide a comparative assessment of approximation in policy
and value space. We rst note that in comparing approaches, one must bear
in mind that specic problems may admit natural parametrizations that
favor one type of approximation over the other. For example, in inventory
control problems, it is natural to consider policy parametrizations that
resemble the (s, S) policies that are optimal for special cases, but also
make intuitive sense in a broader context.
Policy gradient methods for approximation in policy space are sup-
ported by interesting theory and aim directly at nding an optimal policy
within the given parametric class (as opposed to aiming for policy evalua-
tion in the context of an approximate policy iteration scheme). However,
they suer from a drawback that is well-known to practitioners of non-
linear optimization: slow convergence, which unless improved through the
use of eective scaling of the gradient (with an appropriate diagonal or
nondiagonal matrix), all too often leads to jamming (no visible progress)
and complete breakdown. Unfortunately, there has been no proposal of
a demonstrably eective scheme to scale the gradient in policy gradient
methods (see, however, Kakade [Kak02] for an interesting attempt to ad-
dress this issue, based on the work of Amari [Ama98]). Furthermore, the
performance and reliability of policy gradient methods are susceptible to
degradation by large variance of simulation noise. Thus, while policy gradi-
ent methods are supported by convergence guarantees in theory, attaining
convergence in practice is often challenging. In addition, gradient methods
have a generic diculty with local minima, the consequences of which are
not well-understood at present in the context of approximation in policy
space.
A major diculty for approximation in value space is that a good
choice of basis functions/features is often far from evident. Furthermore,
even when good features are available, the indirect approach of TD(),
LSPE(), and LSTD() may neither yield the best possible approxima-
tion of the cost function or the Q-factors of a policy within the feature
subspace, nor yield the best possible performance of the associated one-
step-lookahead policy. In the case of a xed policy, LSTD() and LSPE()
are quite reliable algorithms, in the sense that they ordinarily achieve their
theoretical guarantees in approximating the associated cost function or Q-
factors: they involve solution of systems of linear equations, simulation
(with convergence governed by the law of large numbers), and contraction
iterations (with favorable contraction modulus when is not too close to
0). However, within the multiple policy context of an approximate policy
iteration scheme, TD methods have additional diculties: the need for ad-
equate exploration, the issue of policy oscillation and the related chattering
phenomenon, and the lack of convergence guarantees for both optimistic
and nonoptimistic schemes. When an aggregation method is used for pol-
icy evaluation, these diculties do not arise, but the cost approximation
vectors r are restricted by the requirements that the rows of must be
aggregation probability distributions.
6.10 NOTES, SOURCES, AND EXERCISES
There has been intensive interest in simulation-based methods for approx-
imate DP since the early 90s, in view of their promise to address the dual
curses of DP: the curse of dimensionality (the explosion of the computa-
Sec. 6.10 Notes, Sources, and Exercises 517
tion needed to solve the problem as the number of states increases), and the
curse of modeling (the need for an exact model of the systems dynamics).
We have used the name approximate dynamic programming to collectively
refer to these methods. Two other popular names are reinforcement learn-
ing and neuro-dynamic programming. The latter name, adopted by Bert-
sekas and Tsitsiklis [BeT96], comes from the strong connections with DP
as well as with methods traditionally developed in the eld of neural net-
works, such as the training of approximation architectures using empirical
or simulation data.
Two books were written on the subject in the mid-90s, one by Sutton
and Barto [SuB98], which reects an articial intelligence viewpoint, and
another by Bertsekas and Tsitsiklis [BeT96], which is more mathematical
and reects an optimal control/operations research viewpoint. We refer
to the latter book for a broader discussion of some of the topics of this
chapter [including rigorous convergence proofs of TD() and Q-learning],
for related material on approximation architectures, batch and incremental
gradient methods, and neural network training, as well as for an extensive
overview of the history and bibliography of the subject up to 1996. More
recent books are Cao [Cao07], which emphasizes a sensitivity approach and
policy gradient methods, Chang, Fu, Hu, and Marcus [CFH07], which em-
phasizes nite-horizon/limited lookahead schemes and adaptive sampling,
Gosavi [Gos03], which emphasizes simulation-based optimization and rein-
forcement learning algorithms, Powell [Pow07], which emphasizes resource
allocation and the diculties associated with large control spaces, and Bu-
soniu et. al. [BBD10], which focuses on function approximation methods
for continuous space systems. The book by Haykin [Hay08] discusses ap-
proximate DP within the broader context of neural networks and learning.
The book by Borkar [Bor08] is an advanced monograph that addresses rig-
orously many of the convergence issues of iterative stochastic algorithms
in approximate DP, mainly using the so called ODE approach (see also
Borkar and Meyn [BoM00]). The book by Meyn [Mey07] is broader in its
coverage, but touches upon some of the approximate DP algorithms that
we have discussed.
Several survey papers in the volume by Si, Barto, Powell, and Wun-
sch [SBP04], and the special issue by Lewis, Liu, and Lendaris [LLL08]
describe recent work and approximation methodology that we have not
covered in this chapter: linear programming-based approaches (De Farias
and Van Roy [DFV03], [DFV04a], De Farias [DeF04]), large-scale resource
allocation methods (Powell and Van Roy [PoV04]), and deterministic op-
timal control approaches (Ferrari and Stengel [FeS04], and Si, Yang, and
Liu [SYL04]). An inuential survey was written, from an articial intelli-
gence/machine learning viewpoint, by Barto, Bradtke, and Singh [BBS95].
Some recent surveys are Borkar [Bor09] (a methodological point of view
that explores connections with other Monte Carlo schemes), Lewis and
Vrabie [LeV09] (a control theory point of view), and Szepesvari [Sze09] (a
machine learning point of view), Bertsekas [Ber10a] (which focuses on roll-
out algorithms for discrete optimization), and Bertsekas [Ber10b] (which
focuses on policy iteration and elaborates on some of the topics of this
chapter). The reader is referred to these sources for a broader survey of
the literature of approximate DP, which is very extensive and cannot be
fully covered here.
Direct approximation methods and the tted value iteration approach
have been used for nite horizon problems since the early days of DP. They
are conceptually simple and easily implementable, and they are still in wide
use for approximation of either optimal cost functions or Q-factors (see
e.g., Gordon [Gor99], Longsta and Schwartz [LoS01], Ormoneit and Sen
[OrS02], and Ernst, Geurts, and Wehenkel [EGW06]). The simplications
mentioned in Section 6.1.4 are part of the folklore of DP. In particular, post-
decision states have sporadically appeared in the literature since the early
days of DP. They were used in an approximate DP context by Van Roy,
Bertsekas, Lee, and Tsitsiklis [VBL97] in the context of inventory control
problems. They have been recognized as an important simplication in
the book by Powell [Pow07], which pays special attention to the diculties
associated with large control spaces. For a recent application, see Simao
et. al. [SDG09].
Temporal dierences originated in reinforcement learning, where they
are viewed as a means to encode the error in predicting future costs, which
is associated with an approximation architecture. They were introduced
in the works of Samuel [Sam59], [Sam67] on a checkers-playing program.
The papers by Barto, Sutton, and Anderson [BSA83], and Sutton [Sut88]
proposed the TD() method, on a heuristic basis without a convergence
analysis. The method motivated a lot of research in simulation-based DP,
particularly following an early success with the backgammon playing pro-
gram of Tesauro [Tes92]. The original papers did not discuss mathematical
convergence issues and did not make the connection of TD methods with
the projected equation. Indeed for quite a long time it was not clear which
mathematical problem TD() was aiming to solve! The convergence of
TD() and related methods was considered for discounted problems by sev-
eral authors, including Dayan [Day92], Gurvits, Lin, and Hanson [GLH94],
Jaakkola, Jordan, and Singh [JJS94], Pineda [Pin97], Tsitsiklis and Van
Roy [TsV97], and Van Roy [Van98]. The proof of Tsitsiklis and Van Roy
[TsV97] was based on the contraction property of T (cf. Lemma 6.3.1 and
Prop. 6.3.1), which is the starting point of our analysis of Section 6.3. The
scaled version of TD(0) [cf. Eq. (6.79)] as well as a -counterpart were pro-
posed by Choi and Van Roy [ChV06] under the name Fixed Point Kalman
Filter. The books by Bertsekas and Tsitsiklis [BeT96], and Sutton and
Barto [SuB98] contain a lot of material on TD(), its variations, and its
use in approximate policy iteration.
Generally, projected equations are the basis for Galerkin methods,
which are popular in scientic computation (see e.g., [Kra72], [Fle84]).
These methods typically do not use Monte Carlo simulation, which is es-
sential for the DP context. However, Galerkin methods apply to a broad
range of problems, far beyond DP, which is in part the motivation for our
discussion of projected equations in more generality in Section 6.8.
The LSTD() algorithm was rst proposed by Bradtke and Barto
[BrB96] for = 0, and later extended by Boyan [Boy02] for > 0. For
> 0, the convergence C
()
k
C
()
and d
()
k
d
()
is not as easy to
demonstrate as in the case = 0. An analysis of the law-of-large-numbers
convergence issues associated with LSTD for discounted problems was given
by Nedic and Bertsekas [NeB03]. The more general two-Markov chain
sampling context that can be used for exploration-related methods is an-
alyzed by Bertsekas and Yu [BeY09], and by Yu [Yu10a,b], which shows
convergence under the most general conditions. The analysis of [BeY09]
and [Yu10a,b] also extends to simulation-based solution of general pro-
jected equations. The rate of convergence of LSTD was analyzed by Konda
[Kon02], who showed that LSTD has optimal rate of convergence within a
broad class of temporal dierence methods. The regression/regularization
variant of LSTD is due to Wang, Polydorides, and Bertsekas [WPB09].
This work addresses more generally the simulation-based approximate so-
lution of linear systems and least squares problems, and it applies to LSTD
as well as to the minimization of the Bellman equation error as special cases.
The LSPE() algorithm, was rst proposed for stochastic shortest
path problems by Bertsekas and Ioe [BeI96], and was applied to a chal-
lenging problem on which TD() failed: learning an optimal strategy to
play the game of tetris (see also Bertsekas and Tsitsiklis [BeT96], Section
8.3). The convergence of the method for discounted problems was given
in [NeB03] (for a diminishing stepsize), and by Bertsekas, Borkar, and
Nedic [BBN04] (for a unit stepsize). In the paper [BeI96] and the book
[BeT96], the LSPE method was related to the -policy iteration of Sec-
tion 6.3.9. The paper [BBN04] compared informally LSPE and LSTD for
discounted problems, and suggested that they asymptotically coincide in
the sense described in Section 6.3. Yu and Bertsekas [YuB06b] provided a
mathematical proof of this for both discounted and average cost problems.
The scaled versions of LSPE and the associated convergence analysis were
developed more recently, and within a more general context in Bertsekas
[Ber09b], [Ber11a], which are based on a connection between general pro-
jected equations and variational inequalities. Some related methods were
given by Yao and Liu [YaL08]. The research on policy or Q-factor evalua-
tion methods was of course motivated by their use in approximate policy
iteration schemes. There has been considerable experimentation with such
schemes, see e.g., [BeI96], [BeT96], [SuB98], [LaP03], [JuP07], [BED09].
However, the relative practical advantages of optimistic versus nonopti-
mistic schemes, in conjunction with LSTD, LSPE, and TD(), are not yet
clear. The exploration-enhanced versions of LSPE() and LSTD() of Sec-
tion 6.3.6 are new and were developed as alternative implementations of
the -policy iteration method [Ber11b].
Policy oscillations and chattering were rst described by the author
at an April 1996 workshop on reinforcement learning [Ber96], and were sub-
sequently discussed in Section 6.4.2 of [BeT96]. The size of the oscillations
is bounded by the error bound of Prop. 1.3.6, which is due to [BeT96].
An alternative error bound that is based on the Euclidean norm has been
derived by Munos [Mun03], and by Scherrer [Sch07] who considered the
-policy iteration algorithm of Section 6.3.9. Feature scaling and its eect
on LSTD(), LSPE(), and TD() (Section 6.3.6) was discussed in Bert-
sekas [Ber11a]. The conditions for policy convergence of Section 6.3.8 were
derived in Bertsekas [Ber10b] and [Ber10c].
The exploration scheme with extra transitions (Section 6.3.7) was
given in the paper by Bertsekas and Yu [BeY09], Example 1. The LSTD()
algorithmwith exploration and modied temporal dierences (Section 6.3.7)
was given by Bertsekas and Yu [BeY07], and a convergence with probabil-
ity 1 proof was provided under the condition p
ij
p
ij
for all (i, j) in
[BeY09], Prop. 4. The idea of modied temporal dierences stems from the
techniques of importance sampling, which have been introduced in various
DP-related contexts by a number of authors: Glynn and Iglehart [GlI89]
(for exact cost evaluation), Precup, Sutton, and Dasgupta [PSD01] [for
TD() with exploration and stochastic shortest path problems], Ahamed,
Borkar, and Juneja [ABJ06] (in adaptive importance sampling schemes
for cost vector estimation without approximation), and Bertsekas and Yu
[BeY07], [BeY09] (in the context of the generalized projected equation
methods of Section 6.8.1).
The -policy iteration algorithm discussed in Section 6.3.9 was rst
proposed by Bertsekas and Ioe [BeI96], and it was used originally as the
basis for LSPE and its application to the tetris problem (see also [BeT96],
Sections 2.3.1 and 8.3). The name LSPE was rst used in the subse-
quent paper by Nedic and Bertsekas [NeB03] to describe a specic iterative
implementation of the -PI method with cost function approximation for
discounted MDP (essentially the implementation developed in [BeI96] and
[BeT96], and used for a tetris case study). The second simulation-based
implementation described in this section, which views the policy evalua-
tion problem in the context of a stopping problem, is new (see Bertsekas
[Ber11b]). The third simulation-based implementation in this section, was
proposed by Thierry and Scherrer [ThS10a], [ThS10b], who proposed var-
ious associated optimistic policy iteration implementations that relate to
both LSPE and LSTD.
The aggregation approach has a long history in scientic computa-
tion and operations research. It was introduced in the simulation-based
approximate DP context, mostly in the form of value iteration; see Singh,
Jaakkola, and Jordan [SJJ94], [SJJ95], Gordon [Gor95], Tsitsiklis and Van
Roy [TsV96], and Van Roy [Van06]. Bounds on the error between the opti-
mal cost-to-go vector J
and the limit of the value iteration method in the

case of hard aggregation are given under various assumptions in [TsV96]
(see also Exercise 6.12 and Section 6.7.4 of [BeT96]). Related error bounds
are given by Munos and Szepesvari [MuS08]. A more recent work that
focuses on hard aggregation is Van Roy [Van06]. The analysis given here,
which follows the lines of Section 6.3.4 of Vol. I and emphasizes the impor-
tance of convergence in approximate policy iteration, is somewhat dierent
from alternative developments in the literature.
Multistep aggregation does not seem to have been considered in the
literature, but it may have some important practical applications in prob-
lems where multistep lookahead minimizations are feasible. Also asyn-
chronous distributed aggregation has not been discussed earlier. It is worth
emphasizing that while both projected equation and aggregation methods
produce basis function approximations to costs or Q-factors, there is an
important qualitative dierence that distinguishes the aggregation-based
policy iteration approach: assuming suciently small simulation error, it
is not susceptible to policy oscillation and chattering like the projected
equation or Bellman equation error approaches. The price for this is the
restriction of the type of basis functions that can be used in aggregation.
Q-learning was proposed by Watkins [Wat89], who explained the
essence of the method, but did not provide a rigorous convergence analy-
sis; see also Watkins and Dayan [WaD92]. A convergence proof was given
by Tsitsiklis [Tsi94]. For SSP problems with improper policies, this proof
required the assumption of nonnegative one-stage costs (see also [BeT96],
Prop. 5.6). This assumption was relaxed by Abounadi, Bertsekas, and
Borkar [ABB02], under some conditions and using an alternative line of
proof, based on the so-called ODE approach. The proofs of these references
include the assumption that either the iterates are bounded or other related
restrictions. It was shown by Yu and Bertsekas [YuB11] that the Q-learning
iterates are naturally bounded for SSP problems, even with improper poli-
cies, so the convergence of Q-learning for SSP problems was established
under no more restrictive assumptions than for discounted MDP.
A variant of Q-learning is the method of advantage updating, devel-
oped by Baird [Bai93], [Bai94], [Bai95], and Harmon, Baird, and Klopf
[HBK94]. In this method, instead of aiming to compute Q(i, u), we com-
pute
A(i, u) = Q(i, u) min
uU(i)
Q(i, u).
The function A(i, u) can serve just as well as Q(i, u) for the purpose of com-
puting corresponding policies, based on the minimization min
uU(i)
A(i, u),
but may have a much smaller range of values than Q(i, u), which may be
helpful in contexts involving basis function approximation. When using a
lookup table representation, advantage updating is essentially equivalent to
Q-learning, and has the same type of convergence properties. With func-
tion approximation, the convergence properties of advantage updating are
not well-understood (similar to Q-learning). We refer to the book [BeT96],
Section 6.6.2, for more details and some analysis.
Another variant of Q-learning, also motivated by the fact that we
are really interested in Q-factor dierences rather than Q-factors, has been
discussed in Section 6.4.2 of Vol. I, and is aimed at variance reduction of
Q-factors obtained by simulation. A related variant of approximate policy
iteration and Q-learning, called dierential training, has been proposed by
the author in [Ber97] (see also Weaver and Baxter [WeB99]). It aims to
compute Q-factor dierences in the spirit of the variance reduction ideas
of Section 6.4.2 of Vol. I.
Approximation methods for the optimal stopping problem (Section
6.5.3) were investigated by Tsitsiklis and Van Roy [TsV99b], [Van98], who
noted that Q-learning with a linear parametric architecture could be ap-
plied because the associated mapping F is a contraction with respect to
the norm | |
. They proved the convergence of a corresponding Q-learning

method, and they applied it to a problem of pricing nancial derivatives.
The LSPE algorithm given in Section 6.5.3 for this problem is due to Yu
and Bertsekas [YuB07], to which we refer for additional analysis. An alter-
native algorithm with some similarity to LSPE as well as TD(0) is given
by Choi and Van Roy [ChV06], and is also applied to the optimal stopping
problem. We note that approximate dynamic programming and simulation
methods for stopping problems have become popular in the nance area,
within the context of pricing options; see Longsta and Schwartz [LoS01],
who consider a nite horizon model in the spirit of Section 6.5.4, and Tsit-
siklis and Van Roy [TsV01], and Li, Szepesvari, and Schuurmans [LSS09],
whose works relate to the LSPE method of Section 6.5.3. The constrained
policy iteration method of Section 6.5.3 is closely related to the paper by
Bertsekas and Yu [BeY10a].
Recently, an approach to Q-learning with exploration, called enhanced
policy iteration, has been proposed (Bertsekas and Yu [BeY10a]). Instead
of policy evaluation by solving a linear system of equations, this method
requires (possibly inexact) solution of Bellmans equation for an optimal
stopping problem. It is based on replacing the standard Q-learning map-
ping used for evaluation of a policy with the mapping
(F
J,
Q)(i, u) =
n
j=1
p
ij
(u)
_
_
g(i, u, j) +
vU(j)
(v [ j) min
_
J(j), Q(j, v)
_
_
_
which depends on a vector J
n
, with components denoted J(i), and
on a randomized policy , which for each state i denes a probability
distribution
_
(u [ i) [ u U(i)
_
over the feasible controls at i, and may depend on the current policy .
The vector J is updated using the equation J(i) = min
uU(i)
Q(i, u), and
the current policy is obtained from this minimization. Finding a xed
point of the mapping F
J,
is an optimal stopping problem [a similarity with
the constrained policy iteration (6.209)-(6.210)]. The policy may be cho-
sen arbitrarily at each iteration. It encodes aspects of the current policy
, but allows for arbitrary and easily controllable amount of exploration.
For extreme choices of and a lookup table representation, the algorithms
of [BeY10a] yield as special cases the classical Q-learning/value iteration
and policy iteration methods. Together with linear cost/Q-factor approx-
imation, the algorithms may be combined with the TD(0)-like method of
Tsitsiklis and Van Roy [TsV99b], which can be used to solve the associ-
ated stopping problems with low overhead per iteration, thereby resolving
the issue of exploration. Reference [BeY10a] also provides optimistic asyn-
chronous policy iteration versions of Q-learning, which have guaranteed
convergence properties and lower overhead per iteration over the classical
Q-learning algorithm.
The contraction mapping analysis (Prop. 6.6.1) for SSP problems
in Section 6.6 is based on the convergence analysis for TD() given in
Bertsekas and Tsitsiklis [BeT96], Section 6.3.4. The LSPE algorithm was
rst proposed for SSP problems in [BeI96] as an implementation of the
-policy iteration method of Section 6.3.9 (see also [Ber11b]).
The TD() algorithm was extended to the average cost problem, and
its convergence was proved by Tsitsiklis and Van Roy [TsV99a] (see also
[TsV02]). The average cost analysis of LSPE in Section 6.7.1 is due to Yu
and Bertsekas [YuB06b]. An alternative to the LSPE and LSTD algorithms
of Section 6.7.1 is based on the relation between average cost and SSP
problems, and the associated contracting value iteration method discussed
in Section 4.4.1. The idea is to convert the average cost problem into a
parametric form of SSP, which however converges to the correct one as the
gain of the policy is estimated correctly by simulation. The SSP algorithms
of Section 6.6 can then be used with the estimated gain of the policy
k
replacing the true gain .
While the convergence analysis of the policy evaluation methods of
Sections 6.3 and 6.6 is based on contraction mapping arguments, a dierent
line of analysis is necessary for Q-learning algorithms for average cost prob-
lems (as well as for SSP problems where there may exist some improper
policies). The reason is that there may not be an underlying contraction,
so the nonexpansive property of the DP mapping must be used instead. As
a result, the analysis is more complicated, and a dierent method of proof
has been employed, based on the so-called ODE approach; see Abounadi,
Bertsekas, and Borkar [ABB01], [ABB02], and Borkar and Meyn [BeM00].
In particular, the Q-learning algorithms of Section 6.7.3 were proposed and
analyzed in these references. They are also discussed in the book [BeT96]
(Section 7.1.5). Alternative algorithms of the Q-learning type for average
cost problems were given without convergence proof by Schwartz [Sch93b],
Singh [Sin94], and Mahadevan [Mah96]; see also Gosavi [Gos04].
The framework of Sections 6.8.1-6.8.6 on generalized projected equa-
tion and Bellman error methods is based on Bertsekas and Yu [BeY07],
[BeY09], which also discuss in greater detail multistep methods, and sev-
eral other variants of the methods given here (see also Bertsekas [Ber09b]).
The regression-based method and the condence interval analysis of Prop.
6.8.1 is due to Wang, Polydorides, and Bertsekas [WPB09]. The material of
Section 6.8.7 on oblique projections and the connections to aggregation/dis-
cretization with a coarse grid is based on unpublished collaboration with H.
Yu. The generalized aggregation methodology of Section 6.8.8 is new in the
form given here, but is motivated by the development of aggregation-based
approximate DP given in Section 6.4.
The paper by Yu and Bertsekas [YuB08] derives error bounds which
apply to generalized projected equations and sharpen the rather conserva-
tive bound
|J
1
2
|J
, (6.296)
given for discounted DP problems (cf. Prop. 6.3.2) and the bound
|x
|
_
_
(I A)
1
_
_
|x
_
_
,
for the general projected equation r = (Ar +b) [cf. Eq. (6.241)]. The
bounds of [YuB08] apply also to the case where A is not a contraction and
have the form
|x
B(A, , S) |x
_
_
,
where B(A, , S) is a scalar that [contrary to the scalar 1/
1
2
in Eq.
(6.296)] depends on the approximation subspace S and the structure of
the matrix A. The scalar B(A, , S) involves the spectral radii of some
low-dimensional matrices and may be computed either analytically or by
simulation (in the case where x has large dimension). One of the scalars
B(A, , S) given in [YuB08] involves only the matrices that are computed as
part of the simulation-based calculation of the matrix C
k
via Eq. (6.248), so
it is simply obtained as a byproduct of the LSTD and LSPE-type methods
of Section 6.8.1. Among other situations, such bounds can be useful in cases
where the bias |r
_
_
(the distance between the solution r
of
the projected equation and the best approximation of x
within S, which
is x
) is very large [cf., the example of Exercise 6.9, mentioned earlier,

where TD(0) produces a very bad solution relative to TD() for 1]. A
value of B(A, , S) that is much larger than 1 strongly suggests a large bias
and motivates corrective measures (e.g., increase in the approximate DP
case, changing the subspace S, or changing ). Such an inference cannot
be made based on the much less discriminating bound (6.296), even if A is
a contraction with respect to | |
.
The Bellman equation error approach was initially suggested by Sch-
weitzer and Seidman [ScS85], and simulation-based algorithms based on
this approach were given later by Harmon, Baird, and Klopf [HBK94],
Baird [Bai95], and Bertsekas [Ber95], including the two-sample simulation-
based method for policy evaluation based on minimization of the Bellman
equation error (Section 6.8.5 and Fig. 6.8.2). For some recent develop-
ments, see Ormoneit and Sen [OrS02], Szepesvari and Smart [SzS04], An-
tos, Szepesvari, and Munos [ASM08], Bethke, How, and Ozdaglar [BHO08],
and Scherrer [Sch10].
There is a large literature on policy gradient methods for average cost
problems. The formula for the gradient of the average cost has been given
in dierent forms and within a variety of dierent contexts: see Cao and
Chen [CaC97], Cao and Wan [CaW98], Cao [Cao99], [Cao05], Fu and Hu
[FuH94], Glynn [Gly87], Jaakkola, Singh, and Jordan [JSJ95], LEcuyer
[LEc91], and Williams [Wil92]. We follow the derivations of Marbach and
Tsitsiklis [MaT01]. The inner product expression of (r)/r
m
was used to
delineate essential features for gradient calculation by Konda and Tsitsiklis
[KoT99], [KoT03], and Sutton, McAllester, Singh, and Mansour [SMS99].
Several implementations of policy gradient methods, some of which
use cost approximations, have been proposed: see Cao [Cao04], Grudic and
Ungar [GrU04], He [He02], He, Fu, and Marcus [HFM05], Kakade [Kak02],
Konda [Kon02], Konda and Borkar [KoB99], Konda and Tsitsiklis [KoT99],
[KoT03], Marbach and Tsitsiklis [MaT01], [MaT03], Sutton, McAllester,
Singh, and Mansour [SMS99], and Williams [Wil92].
Approximation in policy space can also be carried out very simply
by a random search method in the space of policy parameters. There has
been considerable progress in random search methodology, and the cross-
entropy method (see Rubinstein and Kroese [RuK04], [RuK08], de Boer
et al [BKM05]) has gained considerable attention. A noteworthy success
with this method has been attained in learning a high scoring strategy in
the game of tetris (see Szita and Lorinz [SzL06], and Thiery and Scherrer
[ThS09]); surprisingly this method outperformed in terms of scoring perfor-
mance methods based on approximate policy iteration, approximate linear
programming, and policy gradient by more than an order of magnitude
(see the discussion of policy oscillations and chattering in Section 6.3.8).
Other random search algorithms have also been suggested; see Chang, Fu,
Hu, and Marcus [CFH07], Ch. 3. Additionally, statistical inference meth-
ods have been adapted for approximation in policy space in the context of
some special applications, with the policy parameters viewed as the param-
eters in a corresponding inference problem; see Attias [Att03], Toussaint
and Storey [ToS06], and Verma and Rao [VeR06].
Approximate DP methods for partially observed Markov decision
problems (POMDP) are not as well-developed as their perfect observa-
tion counterparts. Approximations obtained by aggregation/interpolation
schemes and solution of nite-spaces discounted or average cost problems
have been proposed by Zhang and Liu [ZhL97], Zhou and Hansen [ZhH01],
and Yu and Bertsekas [YuB04] (see Example 6.4.1); see also Zhou, Fu,
and Marcus [ZFM10]. Alternative approximation schemes based on nite-
state controllers are analyzed in Hauskrecht [Hau00], Poupart and Boutilier
[PoB04], and Yu and Bertsekas [YuB06a]. Policy gradient methods of the
actor-only type have been given by Baxter and Bartlett [BaB01], and Ab-
erdeen and Baxter [AbB00]. An alternative method, which is of the actor-
critic type, has been proposed by Yu [Yu05]. See also Singh, Jaakkola, and
Jordan [SJJ94], and Moazzez-Estanjini, Li, and Paschalidis [ELP09].
Many problems have special structure, which can be exploited in ap-
proximate DP. For some representative work, see Guestrin et al. [GKP03],
and Koller, and Parr [KoP00].
E X E R C I S E S
6.1
Consider a fully connected network with n nodes, and the problem of nding a
travel strategy that takes a traveller from node 1 to node n in no more than a
given number m of time periods, while minimizing the expected travel cost (sum
of the travel costs of the arcs on the travel path). The cost of traversing an arc
changes randomly and independently at each time period with given distribution.
For any node i, the current cost of traversing the outgoing arcs (i, j), j = i, will
become known to the traveller upon reaching i, who will then either choose the
next node j on the travel path, or stay at i (waiting for smaller costs of outgoing
arcs at the next time period) at a xed (deterministic) cost per period. Derive
a DP algorithm in a space of post-decision variables and compare it to ordinary
DP.
6.2 (Multiple State Visits in Monte Carlo Simulation)
Argue that the Monte Carlo simulation formula
J(i) = lim
M
1
M
M
m=1
c(i, m)
is valid even if a state may be revisited within the same sample trajectory. Note:
If only a nite number of trajectories is generated, in which case the number
M of cost samples collected for a given state i is nite and random, the sum
1
M
M
m=1
c(i, m) need not be an unbiased estimator of J(i). However, as the
number of trajectories increases to innity, the bias disappears. See [BeT96],
Sections 5.1, 5.2, for a discussion and examples. Hint: Suppose the M cost
samples are generated from N trajectories, and that the kth trajectory involves
n
k
visits to state i and generates n
k
corresponding cost samples. Denote m
k
=
n1 + +n
k
. Write
lim
M
1
M
M
m=1
c(i, m) = lim
N
1
N
N
k=1
m
k
m=m
k1
+1
c(i, m)
1
N
(n1 + +nN)
=
E
_
m
k
m=m
k1
+1
c(i, m)
_
E{n
k
}
,
and argue that
E
_
_
_
m
k
m=m
k1
+1
c(i, m)
_
_
_
= E{n
k
}J(i),
(or see Ross [Ros83b], Cor. 7.2.3 for a closely related result).
6.3 (Viewing Q-Factors as Optimal Costs)
Consider the stochastic shortest path problem under Assumptions 2.1.1 and 2.1.2.
Show that the Q-factors Q(i, u) can be viewed as state costs associated with a
modied stochastic shortest path problem. Use this fact to show that the Q-
factors Q(i, u) are the unique solution of the system of equations
Q(i, u) =
j
pij(u)
_
g(i, u, j) + min
vU(j)
Q(j, v)
_
.
Hint: Introduce a new state for each pair (i, u), with transition probabilities
pij(u) to the states j = 1, . . . , n, t.
6.4
This exercise provides a counterexample to the convergence of PVI for discounted
problems when the projection is with respect to a norm other than
. Consider
the mapping TJ = g+PJ and the algorithm r
k+1
= T(r
k
), where P and
satisfy Assumptions 6.3.1 and 6.3.2. Here denotes projection on the subspace
spanned by with respect to the weighted Euclidean norm Jv =
V J,
where V is a diagonal matrix with positive components. Use the formula =
(
V )
1
V to show that in the single basis function case ( is an n 1

vector) the algorithm is written as
r
k+1
=

V g
V
+

V P
V
r
k
.
Construct choices of , g, P, , and V for which the algorithm diverges.
6.5 (LSPE(0) for Average Cost Problems [YuB06b])
Show the convergence of LSPE(0) for average cost problems with unit stepsize,
assuming that P is aperiodic, by showing that the eigenvalues of the matrix F
lie strictly within the unit circle.
6.6 (Relation of Discounted and Average Cost Approximations
[TsV02])
Consider the nite-state -discounted and average cost frameworks of Sections
6.3 and 6.7 for a xed stationary policy with cost per stage g and transition
probability matrix P. Assume that the states form a single recurrent class, let
J be the -discounted cost vector, let (
, h
) be the gain-bias pair, let be

the steady-state probability vector, let be the diagonal matrix with diagonal
elements the components of , and let
P
= lim
N
N1
k=0
P
k
.
Show that:
(a)
= (1 )
J and P
J = (1 )
1
e.
(b) h
= lim1(I P
)J. Hint: Use the Laurent series expansion of J (cf.

Prop. 4.1.2).
(c) Consider the subspace
E
=
_
(I P
)y | y
n
_
,
which is orthogonal to the unit vector e in the scaled geometry where x
and y are orthogonal if x
y = 0 (cf. Fig. 6.7.1). Verify that J can be

decomposed into the sum of two vectors that are orthogonal (in the scaled
geometry): P
J, which is the projection of J onto the line dened by e,

and (I P
)J, which is the projection of J onto E
and converges to h
as 1.
(d) Use part (c) to show that the limit r
,
of PVI() for the -discounted
problem converges to the limit r
of PVI() for the average cost problem

as 1.
6.7 (Conversion of SSP to Average Cost Policy Evaluation)
We have often used the transformation of an average cost problem to an SSP
problem (cf. Section 4.3.1, and Chapter 7 of Vol. I). The purpose of this exercise
(unpublished collaboration of H. Yu and the author) is to show that a reverse
transformation is possible, from SSP to average cost, at least in the case where
all policies are proper. As a result, analysis, insights, and algorithms for average
cost policy evaluation can be applied to policy evaluation of a SSP problem.
Consider the SSP problem, a single proper stationary policy , and the
probability distribution q0 =
_
q0(1), . . . , q0(n)
_
used for restarting simulated tra-
jectories [cf. Eq. (6.215)]. Let us modify the Markov chain by eliminating the
self-transition from state 0 to itself, and substituting instead transitions from 0
to i with probabilities q0(i),
p0i = q0(i),
each with a xed transition cost , where is a scalar parameter. All other
transitions and costs remain the same (cf. Fig. 6.10.1). We call the corresponding
average cost problem -AC. Denote by J the SSP cost vector of , and by
and h
(i) the average and dierential costs of -AC, respectively.

(a) Show that
can be expressed as the average cost per stage of the cycle

that starts at state 0 and returns to 0, i.e.,
=
+
n
i=1
q0(i)J(i)
T
,
where T is the expected time to return to 0 starting from 0.
(b) Show that for the special value
=
n
i=1
q0(i)J(i),
i
j
0
p
00
= 1
p
ij
p
ji
p
j0
p
i0
i
j
0 p
ij
p
ji
q
0
(i)
p
i0
p
j0
q
0
(j)
SSP Problem Average Cost Problem
Figure 6.10.1 Transformation of a SSP problem to an average cost problem.
The transitions from 0 to each i = 1, . . . , n, have cost .
we have
= 0, and
J(i) = h
(i) h
(0), i = 1, . . . , n.
Hint: Since the states of -AC form a single recurrent class, we have from
Bellmans equation
+h
(i) =
n
j=0
pij
_
g(i, j) +h
(j)
_
, i = 1, . . . , n, (6.297)
+h
(0) = +
n
i=1
q0(i)h
(i). (6.298)
From Eq. (6.297) it follows that if =
, we have
= 0, and
(i) =
n
j=0
pijg(i, j) +
n
j=1
pij(j), i = 1, . . . , n, (6.299)
where
(i) = h
(i) h
(0), i = 1, . . . , n.
Since Eq. (6.299) is Bellmans equation for the SSP problem, we see that
(i) = J(i) for all i.
(c) Derive a transformation to convert an average cost policy evaluation prob-
lem into another average cost policy evaluation problem where the transi-
tion probabilities out of a single state are modied in any way such that
the states of the resulting Markov chain form a single recurrent class. The
two average cost problems should have the same dierential cost vectors,
except for a constant shift. Note: This conversion may be useful if the
transformed problem has more favorable properties.
6.8 (Projected Equations for Finite-Horizon Problems)
Consider a nite-state nite-horizon policy evaluation problem with the cost vec-
tor and transition matrices at time m denoted by gm and Pm, respectively. The
DP algorithm/Bellmans equation takes the form
Jm = gm +PmJm+1, m = 0, . . . , N 1,
where Jm is the cost vector of stage m for the given policy, and JN is a given
terminal cost vector. Consider a low-dimensional approximation of Jm that has
the form
Jm mrm, m = 0, . . . , N 1,
where m is a matrix whose columns are basis functions. Consider also a pro-
jected equation of the form
mrm = m(gm +Pmm+1rm+1), m = 0, . . . , N 1,
where m denotes projection onto the space spanned by the columns of m with
respect to a weighted Euclidean norm with weight vector m.
(a) Show that the projected equation can be written in the equivalent form
m
m(mrm gm Pmm+1rm+1) = 0, m = 0, . . . , N 2,
N1
N1(N1rN1 gN1 PN1JN) = 0,
where m is the diagonal matrix having the vector m along the diagonal.
Abbreviated solution: The derivation follows the one of Section 6.3.1 [cf. the
analysis leading to Eqs. (6.40) and (6.41)]. The solution {r
0
, . . . , r
N1
} of
the projected equation is obtained as
r
m
= arg min
r
0
,...,r
N1
_
N2
m=0
_
_
mrm (gm +Pmm+1r
m+1
)
_
_
2
m
+
_
_
N1rN1 (gN1 +PN1JN)
_
_
2
N1
_
.
The minimization can be decomposed into N minimizations, one for each
m, and by setting to 0 the gradient with respect to rm, we obtain the
desired form.
(b) Consider a simulation scheme that generates a sequence of trajectories of
the system, similar to the case of a stochastic shortest path problem (cf.
Section 6.6). Derive corresponding LSTD and (scaled) LSPE algorithms.
(c) Derive appropriate modications of the algorithms of Section 6.5.3 to ad-
dress a nite horizon version of the optimal stopping problem.
6.9 (Approximation Error of TD Methods [Ber95])
This exercise illustrates how the value of may signicantly aect the approxi-
mation quality in TD methods. Consider a problem of the SSP type, but with a
single policy. The states are 0, 1, . . . , n, with state 0 being the termination state.
Under the given policy, the system moves deterministically from state i 1 to
state i 1 at a cost gi. Consider a linear approximation of the form
J(i, r) = i r
for the cost-to-go function, and the application of TD methods. Let all simulation
runs start at state n and end at 0 after visiting all the states n 1, n 2, . . . , 1
in succession.
(a) Derive the corresponding projected equation r
= T
()
(r
) and show
that its unique solution r
satises
n
k=1
(g
k
r
)
_
nk
n +
nk1
(n 1) + +k
_
= 0.
(b) Plot

J(i, r
) with from 0 to 1 in increments of 0.2, for the following two

cases:
(1) n = 50, g1 = 1 and gi = 0 for all i = 1.
(2) n = 50, gn = (n 1) and gi = 1 for all i = n.
Figure 6.10.2 gives the results for = 0 and = 1.
6.11
This exercise provides an example of comparison of the projected equation ap-
proach of Section 6.8.1 and the least squares approach of Section 6.8.2. Consider
the case of a linear system involving a vector x with two block components,
x1
k
and x2
m
. The system has the form
x1 = A11x1 +b1, x2 = A21x1 +A22x2 +b2,
so x1 can be obtained by solving the rst equation. Let the approximation
subspace be
k
S2, where S2 is a subspace of
m
. Show that with the projected
equation approach, we obtain the component x
1
of the solution of the original
equation, but with the least squares approach, we do not.
6.12 (Error Bounds for Hard Aggregation [TsV96])
Consider the hard aggregation case of Section 6.4.2, and denote i x if the
original state i belongs to aggregate state x. Also for every i denote by x(i) the
aggregate state x with i x. Consider the corresponding mapping F dened by
(FR)(x) =
n
i=1
dxi min
uU(i)
n
j=1
pij(u)
_
g(i, u, j) +R
_
x(j)
_
_
, x A,
0
0.5
1
1.5
0 10 20 30 40 50
State i
TD(0) Approximation
Cost function J(i)
TD(1) Approximation
- 50.0
- 25.0
0.0
25.0
50.0
0 10 20 30 40 50
State i
Cost function J(i)
TD(1) Approximation
TD(0) Approximation
Figure 6.10.2 Form of the cost-to-go function J(i), and the linear representations
J(i, r
) in Exercise 6.9, for the case

g
1
= 1, g
i
= 0, i = 1
(gure on the left), and the case.
gn = (n 1), g
i
= 1, i = n
(gure on the right).
[cf. Eq. (6.161)], and let R
be the unique xed point of this mapping. Show that

R
(x)

1
J
(i) R
(x) +

1
, x A, i x,
where
= max
xA
max
i,jx
(i) J
(j)
.
Abbreviated Proof : Let the vector R be dened by
R(x) = min
ix
J
(i) +

1
, x A.
We have for all x A,
(FR)(x) =
n
i=1
dxi min
uU(i)
n
j=1
pij(u)
_
g(i, u, j) +R
_
x(j)
_
_
i=1
dxi min
uU(i)
n
j=1
pij(u)
_
g(i, u, j) +J
(j) +

1
_
=
n
i=1
dxi
_
J
(i) +

1
_
min
ix
_
J
(i) +
_
+

1
= min
ix
J
(i) +

1
= R(x).
Thus, FR R, from which it follows that R
R (since R
= lim
k
F
k
R and
F is monotone). This proves the left-hand side of the desired inequality. The
right-hand side follows similarly.
6.13 (Hard Aggregation as a Projected Equation Method)
Consider a xed point equation of the form
r = DT(r),
where T :
n

n
is a (possibly nonlinear) mapping, and D and are s n
and n s matrices, respectively, and has rank s. Writing this equation as
r = DT(r),
we see that it is a projected equation if D is a projection onto the subspace
S = {r | r
s
} with respect to a weighted Euclidean norm. The purpose of
this exercise is to prove that this is the case in hard aggregation schemes, where
the set of indices {1, . . . , n} is partitioned into s disjoint subsets I1, . . . , Is and:
(1) The th column of has components that are 1 or 0 depending on whether
they correspond to an index in I
or not.
(2) The th row of D is a probability distribution (d
1
, . . . , d
n
) whose compo-
nents are positive depending on whether they correspond to an index in I
or not, i.e.,
n
i=1
d
i
= 1, d
i
> 0 if i I
, and d
i
= 0 if i / I
.
Show in particular that D is given by the projection formula
D = (
)
1
,
where is the diagonal matrix with the nonzero components of D along the
diagonal, normalized so that they form a probability distribution, i.e.,
i =
d
i
s
k=1
n
j=1
d
kj
, i I
, = 1, . . . , s.
Notes: (1) Based on the preceding result, if T is a contraction mapping with
respect to the projection norm, the same is true for DT. In addition, if T is
a contraction mapping with respect to the sup-norm, the same is true for DT
(since aggregation and disaggregation matrices are nonexpansive with respect to
the sup-norm); this is true for all aggregation schemes, not just hard aggregation.
(2) For D to be a weighted Euclidean projection, we must have DD = D.
This implies that if D is invertible and D is a weighted Euclidean projection,
we must have D = I (since if D is invertible, has rank s, which implies
that DD = D and hence D = I, since D also has rank s). From this it can
be seen that out of all possible aggregation schemes with D invertible and D
having nonzero columns, only hard aggregation has the projection property of
this exercise.
6.14 (Simulation-Based Implementation of Linear Aggregation
Schemes)
Consider a linear system Ax = b, where A is an n n matrix and b is a column
vector in
n
. In a scheme that generalizes the aggregation approach of Section
6.5, we introduce an ns matrix , whose columns are viewed as basis functions,
and an s n matrix D. We nd a solution r
of the s s system
DAr = Db,
and we view r
as an approximate solution of Ax = b. An approximate imple-

mentation is to compute by simulation approximations

C and

d of the matrices
C = DA and d = Db, respectively, and then solve the system

Cr =

d. The
purpose of the exercise is to provide a scheme for doing this.
Let D and A be matrices of dimensions sn and nn, respectively, whose
rows are probability distributions, and are such that their components satisfy
G
i
> 0 if G
i
= 0, = 1, . . . , s, i = 1, . . . , n,
Aij > 0 if Aij = 0, i = 1, . . . , n, j = 1, . . . , n.
We approximate the (m)th component C
m
of C as follows. We generate a
sequence
_
(it, jt) | t = 1, 2, . . .
_
by independently generating each it according
to the distribution {G
i
| i = 1, . . . , n}, and then by independently generating jt
according to the distribution {Ai
t
j | j = 1, . . . , n}. For k = 1, 2, . . ., consider the
scalar
C
k
m
=
1
k
k
t=1
G
i
t
G
i
t
Ai
t
j
t
Ai
t
j
t
j
t
m,
where jm denotes the (jm)th component of . Show that with probability 1
we have
lim
k
C
k
m
= C
m
.
Derive a similar scheme for approximating the components of d.
6.15 (Approximate Policy Iteration Using an Approximate
Problem)
Consider the discounted problem of Section 6.3 (referred to as DP) and an ap-
proximation to this problem (this is a dierent discounted problem referred to
as AP). This exercise considers an approximate policy iteration method where
the policy evaluation is done through AP, but the policy improvement is done
through DP a process that is patterned after the aggregation-based policy iter-
ation method of Section 6.4.2. In particular, we assume that the two problems,
DP and AP, are connected as follows:
(1) DP and AP have the same state and control spaces, and the same policies.
(2) For any policy , its cost vector in AP, denoted

J, satises
J J ,
i.e., policy evaluation using AP, rather than DP, incurs an error of at most
in sup-norm.
(3) The policy obtained by exact policy iteration in AP satises the equation
T

J
= T
.
This is true in particular if the policy improvement process in AP is iden-
tical to the one in DP.
Show the error bound
J

2
1
e
[cf. Eq. (6.134)]. Hint: Follow the derivation of the error bound (6.134).
6.16 (Approximate Policy Iteration and Newtons Method)
Consider the discounted problem, and a policy iteration method that uses func-
tion approximation over a subspace S = {r | r
s
}, and operates as follows:
Given a vector r
k

s
, dene
k
to be a policy such that
T
k
(r
k
) = T(r
k
), (6.300)
and let r
k+1
satisfy
r
k+1
= L
k+1
T
k
(r
k+1
),
where L
k+1
is an sn matrix. Show that if
k
is the unique policy satisfying Eq.
(6.300), then r
k+1
is obtained from r
k
by a Newton step for solving the equation
r = L
k+1
T(r),
(cf. Exercise 1.10).
6.17 (Projected Equations for Approximation of Cost Function
Dierences)
Let x
be the unique solution of an equation of the form x = b + Ax. Suppose

that we are interested in approximating within a subspace S = {r | r
s
} the
vector y
= Dx
, where D is an invertible matrix.

(a) Show that y
is the unique solution of y = c +By, where

B = DAD
1
, c = Db.
(b) Consider the projected equation approach of approximating y
by r
,
obtained by solving r = (c + Br) (cf. Section 6.8.1), and the spe-
cial case where D is the matrix that maps
_
x(1), . . . , x(n)
_
to the vector
_
y(1), . . . , y(n)
_
, with y consisting of component dierences of x: y(i) =
x(i) x(n), i = 1, . . . , n 1, and y(n) = x(n). Calculate D, B, and c, and
develop a simulation-based matrix inversion approach for this case.
6.18 (Constrained Projected Equations [Ber09b], [Ber11a])
Consider the projected equation J = TJ, where the projection is done on a
closed convex subset

S of the approximation subspace S = {r | r
s
} (rather
than on S itself).
(a) Show that the projected equation is equivalent to nding r

s
such that
f(r
(r r
) 0, r

R,
where
f(J) = (J TJ),

R = {r | r

S}.
Note: This type of inequality is known as a variational inequality.
(b) Consider the special case where is partitioned as [1

] where 1 is the
rst column of and

is the n(s 1) matrix comprising the remaining
(s 1) columns. Let

S be the ane subset of S given by
S = { +

r | r
s1
},
where is a xed multiple of the vector 1. Let TJ = g +AJ where A is
an nn matrix. Show that the projected equation is equivalent to nding
r
s1
that solves the equation

Cr =

d, where
C =

(I A)
,

d =

(g +A1 1).
Derive an LSTD-type algorithm to obtain simulation-based approximations
to

C and

d, and corresponding approximation to r.
6.19 (Policy Gradient Formulas for SSP)
Consider the SSP context, and let the cost per stage and transition probabil-
ity matrix be given as functions of a parameter vector r. Denote by gi(r),
i = 1, . . . , n, the expected cost starting at state i, and by pij(r) the transi-
tion probabilities. Each value of r denes a stationary policy, which is assumed
proper. For each r, the expected costs starting at states i are denoted by Ji(r).
We wish to calculate the gradient of a weighted sum of the costs Ji(r), i.e.,
J(r) =
n
i=1
q(i)Ji(r),
where q =
_
q(1), . . . , q(n)
_
is some probability distribution over the states. Con-
sider a single scalar component rm of r, and dierentiate Bellmans equation to
show that
Ji
rm
=
gi
rm
+
n
j=1
pij
rm
Jj +
n
j=1
pij
Jj
rm
, i = 1, . . . , n,
where the argument r at which the partial derivatives are computed is suppressed.
Interpret the above equation as Bellmans equation for a SSP problem.
References
[ABB01] Abounadi, J., Bertsekas, B. P., and Borkar, V. S., 2001. Learning
Algorithms for Markov Decision Processes with Average Cost, SIAM J.
on Control and Optimization, Vol. 40, pp. 681-698.
[ABB02] Abounadi, J., Bertsekas, B. P., and Borkar, V. S., 2002. Stochas-
tic Approximation for Non-Expansive Maps: Q-Learning Algorithms, SI-
AM J. on Control and Optimization, Vol. 41, pp. 1-22.
[ABJ06] Ahamed, T. P. I., Borkar, V. S., and Juneja, S., 2006. Adap-
tive Importance Sampling Technique for Markov Chains Using Stochastic
Approximation, Operations Research, Vol. 54, pp. 489-504.
[ASM08] Antos, A., Szepesvari, C., and Munos, R., 2008. Learning Near-
Optimal Policies with Bellman-Residual Minimization Based Fitted Policy
Iteration and a Single Sample Path, Vol. 71, pp. 89-129.
[AbB02] Aberdeen, D., and Baxter, J., 2002. Scalable Internal-State Policy-
Gradient Methods for POMDPs, Proc. of the Nineteenth International
Conference on Machine Learning, pp. 3-10.
[Ama98] Amari, S., 1998. Natural Gradient Works Eciently in Learn-
ing, Neural Computation, Vol. 10, pp. 251-276.
[Att03] Attias, H. 2003. Planning by Probabilistic Inference, in C. M.
Bishop and B. J. Frey, (Eds.), Proc. of the 9th Int. Workshop on Articial
Intelligence and Statistics.
[BBN04] Bertsekas, D. P., Borkar, V., and Nedic, A., 2004. Improved
Temporal Dierence Methods with Linear Function Approximation, in
Learning and Approximate Dynamic Programming, by J. Si, A. Barto, W.
Powell, (Eds.), IEEE Press, N. Y.
[BBS95] Barto, A. G., Bradtke, S. J., and Singh, S. P., 1995. Real-
Time Learning and Control Using Asynchronous Dynamic Programming,
Articial Intelligence, Vol. 72, pp. 81-138.
[BBD10] Busoniu, L., Babuska, R., De Schutter, B., and Ernst, D., 2010.
Reinforcement Learning and Dynamic Programming Using Function Ap-
539
540 References
proximators, CRC Press, N. Y.
[BED09] Busoniu, L., Ernst, D., De Schutter, B., and Babuska, R., 2009.
Online Least-Squares Policy Iteration for Reinforcement Learning Con-
trol, unpublished report, Delft Univ. of Technology, Delft, NL.
[BHO08] Bethke, B., How, J. P., and Ozdaglar, A., 2008. Approximate Dy-
namic Programming Using Support Vector Regression, Proc. IEEE Con-
ference on Decision and Control, Cancun, Mexico.
[BKM05] de Boer, P. T., Kroese, D. P., Mannor, S., and Rubinstein, R. Y.
2005. A Tutorial on the Cross-Entropy Method, Annals of Operations
Research, Vol. 134, pp. 19-67.
[BMP90] Benveniste, A., Metivier, M., and Priouret, P., 1990. Adaptive
Algorithms and Stochastic Approximations, Springer-Verlag, N. Y.
[BSA83] Barto, A. G., Sutton, R. S., and Anderson, C. W., 1983. Neuron-
like Elements that Can Solve Dicult Learning Control Problems, IEEE
Trans. on Systems, Man, and Cybernetics, Vol. 13, pp. 835-846.
[BaB01] Baxter, J., and Bartlett, P. L., 2001. Innite-Horizon Policy-
Gradient Estimation, J. Articial Intelligence Research, Vol. 15, pp. 319
350.
[Bai93] Baird, L. C., 1993. Advantage Updating, Report WL-TR-93-
1146, Wright Patterson AFB, OH.
[Bai94] Baird, L. C., 1994. Reinforcement Learning in Continuous Time:
Advantage Updating, International Conf. on Neural Networks, Orlando,
Fla.
[Bai95] Baird, L. C., 1995. Residual Algorithms: Reinforcement Learning
with Function Approximation, Dept. of Computer Science Report, U.S.
Air Force Academy, CO.
[BeI96] Bertsekas, D. P., and Ioe, S., 1996. Temporal Dierences-Based
Policy Iteration and Applications in Neuro-Dynamic Programming, Lab.
for Info. and Decision Systems Report LIDS-P-2349, Massachusetts Insti-
tute of Technology.
[BeT89] Bertsekas, D. P., and Tsitsiklis, J. N., 1989. Parallel and Dis-
tributed Computation: Numerical Methods, Prentice-Hall, Englewood Clis,
N. J.; republished by Athena Scientic, Belmont, MA, 1997.
[BeT96] Bertsekas, D. P., and Tsitsiklis, J. N., 1996. Neuro-Dynamic Pro-
gramming, Athena Scientic, Belmont, MA.
[BeT00] Bertsekas, D. P., and Tsitsiklis, J. N., 2000. Gradient Convergence
in Gradient Methods, SIAM J. on Optimization, Vol. 10, pp. 627-642.
[BeY07] Bertsekas, D. P., and Yu, H., 2007. Solution of Large Systems
of Equations Using Approximate Dynamic Programming Methods, LIDS
References 541
Report 2754, MIT.
[BeY09] Bertsekas, D. P., and Yu, H., 2009. Projected Equation Methods
for Approximate Solution of Large Linear Systems, Journal of Computa-
tional and Applied Mathematics, Vol. 227, pp. 27-50.
[BeY10a] Bertsekas, D. P., and Yu, H., 2010. Q-Learning and Enhanced
Policy Iteration in Discounted Dynamic Programming, Lab. for Informa-
tion and Decision Systems Report LIDS-P-2831, MIT.
[BeY10b] Bertsekas, D. P., and Yu, H., 2010. Asynchronous Distributed
Policy Iteration in Dynamic Programming, Proc. of Allerton Conf. on
Information Sciences and Systems.
[Ber77] Bertsekas, D. P., 1977. Monotone Mappings with Application in
Dynamic Programming, SIAM J. on Control and Optimization, Vol. 15,
pp. 438-464.
[Ber82] Bertsekas, D. P., 1982. Distributed Dynamic Programming, IEEE
Trans. Automatic Control, Vol. AC-27, pp. 610-616.
[Ber83] Bertsekas, D. P., 1983. Asynchronous Distributed Computation of
Fixed Points, Math. Programming, Vol. 27, pp. 107-120.
[Ber95] Bertsekas, D. P., 1995. A Counterexample to Temporal Dierences
Learning, Neural Computation, Vol. 7, pp. 270-279.
[Ber96] Bertsekas, D. P., 1996. Lecture at NSF Workshop on Reinforcement
Learning, Hilltop House, Harpers Ferry, N.Y.
[Ber97] Bertsekas, D. P., 1997. Dierential Training of Rollout Policies,
Proc. of the 35th Allerton Conference on Communication, Control, and
Computing, Allerton Park, Ill.
[Ber99] Bertsekas, D. P., 1999. Nonlinear Programming: 2nd Edition, Athe-
na Scientic, Belmont, MA.
[Ber05a] Bertsekas, D. P., 2005. Dynamic Programming and Suboptimal
Control: A Survey from ADP to MPC, in Fundamental Issues in Control,
European J. of Control, Vol. 11.
[Ber05b] Bertsekas, D. P., 2005. Rollout Algorithms for Constrained Dy-
namic Programming, Lab. for Information and Decision Systems Report
2646, MIT.
[Ber09a] Bertsekas, D. P., 2009. Convex Optimization Theory, Athena Sci-
entic, Belmont, MA.
[Ber09b] Bertsekas, D. P., 2009. Projected Equations, Variational Inequal-
ities, and Temporal Dierence Methods, Lab. for Information and Decision
Systems Report LIDS-P-2808, MIT.
[Ber10a] Bertsekas, D. P., 2010. Rollout Algorithms for Discrete Opti-
542 References
mization: A Survey, Lab. for Information and Decision Systems Report,
MIT; to appear in Handbook of Combinatorial Optimization, by D. Zu,
and P. Pardalos (eds.), Springer, N. Y.
[Ber10b] Bertsekas, D. P., 2010. Approximate Policy Iteration: A Survey
and Some New Methods, Lab. for Information and Decision Systems Re-
port LIDS-P-2833, MIT; J. of Control Theory and Applications, Vol. 9, pp.
310-335.
[Ber10c] Bertsekas, D. P., 2010. Pathologies of Temporal Dierence Meth-
ods in Approximate Dynamic Programming, Proc. 2010 IEEE Conference
on Decision and Control.
[Ber10d] Bertsekas, D. P., 2010. Incremental Gradient, Subgradient, and
Proximal Methods for Convex Optimization: A Survey, Lab. for Informa-
tion and Decision Systems Report LIDS-P-2848, MIT.
[Ber11a] Bertsekas, D. P., 2011. Temporal Dierence Methods for General
Projected Equations, IEEE Trans. on Automatic Control, Vol. 56, (to
appear).
[Ber11b] Bertsekas, D. P., 2011. -Policy Iteration: A Review and a New
Implementation, Lab. for Information and Decision Systems Report LIDS-
P-2874, MIT; to appear in Reinforcement Learning and Approximate Dy-
namic Programming for Feedback Control , by F. Lewis and D. Liu (eds.),
IEEE Press Computational Intelligence Series.
[BoM00] Borkar, V. S., and Meyn, S. P., 2000. The O.D.E. Method for
Convergence of Stochastic Approximation and Reinforcement Learning,
SIAM J. Control and Optimization, Vol. 38, pp. 447-469.
[Bor08] Borkar, V. S., 2008. Stochastic Approximation: A Dynamical Sys-
tems Viewpoint, Cambridge Univ. Press, N. Y.
[Bor09] Borkar, V. S., 2009. Reinforcement Learning A Bridge Between
Numerical Methods and Monte Carlo, in World Scientic Review Vol. 9,
Chapter 4.
[Boy02] Boyan, J. A., 2002. Technical Update: Least-Squares Temporal
Dierence Learning, Machine Learning, Vol. 49, pp. 1-15.
[BrB96] Bradtke, S. J., and Barto, A. G., 1996. Linear Least-Squares
Algorithms for Temporal Dierence Learning, Machine Learning, Vol. 22,
pp. 33-57.
[Bur97] Burgiel, H., 1997. How to Lose at Tetris, The Mathematical
Gazette, Vol. 81, pp. 194-200.
[CFH07] Chang, H. S., Fu, M. C., Hu, J., Marcus, S. I., 2007. Simulation-
Based Algorithms for Markov Decision Processes, Springer, N. Y.
[CPS92] Cottle, R. W., Pang, J-S., and Stone, R. E., 1992. The Linear
References 543
Complementarity Problem, Academic Press, N. Y.; republished by SIAM
in 2009.
[CaC97] Cao, X. R., and Chen, H. F., 1997. Perturbation Realization Po-
tentials and Sensitivity Analysis of Markov Processes, IEEE Transactions
on Automatic Control, Vol. 32, pp. 1382-1393.
[CaW98] Cao, X. R., and Wan, Y. W., 1998. Algorithms for Sensitivity
Analysis of Markov Systems Through Potentials and Perturbation Realiza-
tion, IEEE Transactions Control Systems Technology, Vol. 6, pp. 482-494.
[Cao99] Cao, X. R., 1999. Single Sample Path Based Optimization of
Markov Chains, J. of Optimization Theory and Applicationa, Vol. 100,
pp. 527-548.
[Cao04] Cao, X. R., 2004. Learning and Optimization from a System The-
oretic Perspective, in Learning and Approximate Dynamic Programming,
by J. Si, A. Barto, W. Powell, (Eds.), IEEE Press, N. Y.
[Cao05] Cao, X. R., 2005. A Basic Formula for Online Policy Gradient
Algorithms, IEEE Transactions on Automatic Control, Vol. 50, pp. 696-
699.
[Cao07] Cao, X. R., 2007. Stochastic Learning and Optimization: A Sensiti-
vity-Based Approach, Springer, N. Y.
[ChV06] Choi, D. S., and Van Roy, B., 2006. A Generalized Kalman Filter
for Fixed Point Approximation and Ecient Temporal-Dierence Learn-
ing, Discrete Event Dynamic Systems, Vol. 16, pp. 207-239.
[DFM09] Desai, V. V., Farias, V. F., and Moallemi, C. C., 2009. Aprox-
imate Dynamic Programming via a Smoothed Approximate Linear Pro-
gram, Submitted.
[DFV00] de Farias, D. P., and Van Roy, B., 2000. On the Existence of
Fixed Points for Approximate Value Iteration and Temporal-Dierence
Learning, J. of Optimization Theory and Applications, Vol. 105.
[DFV03] de Farias, D. P., and Van Roy, B., 2003. The Linear Programming
Approach to Approximate Dynamic Programming, Operations Research,
Vol. 51, pp. 850-865.
[DFV04a] de Farias, D. P., and Van Roy, B., 2004. On Constraint Sam-
pling in the Linear Programming Approach to Approximate Dynamic Pro-
gramming, Mathematics of Operations Research, Vol. 29, pp. 462-478.
[Day92] Dayan, P., 1992. The Convergence of TD() for General , Ma-
chine Learning, Vol. 8, pp. 341-362.
[DeF04] De Farias, D. P., 2004. The Linear Programming Approach to
Approximate Dynamic Programming, in Learning and Approximate Dy-
namic Programming, by J. Si, A. Barto, W. Powell, (Eds.), IEEE Press,
544 References
N. Y.
[Den67] Denardo, E. V., 1967. Contraction Mappings in the Theory Un-
derlying Dynamic Programming, SIAM Review, Vol. 9, pp. 165-177.
[EGW06] Ernst, D., Geurts, P., and Wehenkel, L., 2006. Tree-Based Batch
Mode Reinforcement Learning, Journal of Machine Learning Research,
Vol. 6, pp. 503556.
[ELP09] Moazzez-Estanjini, R., Li, K., and Paschalidis, I. C., 2009. An
Actor-Critic Method Using Least Squares Temporal Dierence Learning
with an Application to Warehouse Management, Proc. of the 48th IEEE
Conference on Decision and Control, Shanghai, China, pp. 2564-2569.
[FaV06] Farias, V. F., and Van Roy, B., 2006. Tetris: A Study of Ran-
domized Constraint Sampling, in Probabilistic and Randomized Methods
for Design Under Uncertainty, Springer-Verlag.
[FeS94] Feinberg, E. A., and Shwartz, A., 1994. Markov Decision Models
with Weighted Discounted Criteria, Mathematics of Operations Research,
Vol. 19, pp. 1-17.
[FeS04] Ferrari, S., and Stengel, R. F., 2004. Model-Based Adaptive Critic
Designs, in Learning and Approximate Dynamic Programming, by J. Si,
A. Barto, W. Powell, (Eds.), IEEE Press, N. Y.
[Fle84] Fletcher, C. A. J., 1984. Computational Galerkin Methods, Springer-
Verlag, N. Y.
[FuH94] Fu, M. C., and Hu, 1994. Smoothed Perturbation Analysis Deriva-
tive Estimation for Markov Chains, Oper. Res. Letters, Vol. 41, pp. 241-
251.
[GKP03] Guestrin, C. E., Koller, D., Parr, R., and Venkataraman, S.,
2003. Ecient Solution Algorithms for Factored MDPs, J. of Articial
Intelligence Research, Vol. 19, pp. 399-468.
[GLH94] Gurvits, L., Lin, L. J., and Hanson, S. J., 1994. Incremen-
tal Learning of Evaluation Functions for Absorbing Markov Chains: New
Methods and Theorems, Preprint.
[GlI89] Glynn, P. W., and Iglehart, D. L., 1989. Importance Sampling for
Stochastic Simulations, Management Science, Vol. 35, pp. 1367-1392.
[Gly87] Glynn, P. W., 1987. Likelihood Ratio Gradient Estimation: An
Overview, Proc. of the 1987 Winter Simulation Conference, pp. 366-375.
[Gor95] Gordon, G. J., 1995. Stable Function Approximation in Dynamic
Programming, in Machine Learning: Proceedings of the Twelfth Interna-
tional Conference, Morgan Kaufmann, San Francisco, CA.
[Gos03] Gosavi, A., 2003. Simulation-Based Optimization Parametric Op-
timization Techniques and Reinforcement Learning, Springer-Verlag, N. Y.
References 545
[Gos04] Gosavi, A., 2004. Reinforcement Learning for Long-Run Average
Cost, European J. of Operational Research, Vol. 155, pp. 654-674.
[GrU04] Grudic, G., and Ungar, L., 2004. Reinforcement Learning in
Large, High-Dimensional State Spaces, in Learning and Approximate Dy-
namic Programming, by J. Si, A. Barto, W. Powell, (Eds.), IEEE Press,
N. Y.
[HBK94] Harmon, M. E., Baird, L. C., and Klopf, A. H., 1994. Advan-
tage Updating Applied to a Dierential Game, Presented at NIPS Conf.,
Denver, Colo.
[HFM05] He, Y., Fu, M. C., and Marcus, S. I., 2005. A Two-Timescale
Simulation-Based Gradient Algorithm for Weighted Cost Markov Decision
Processes, Proc. of the 2005 Conf. on Decision and Control, Seville, Spain,
pp. 8022-8027.
[Hau00] Hauskrecht, M., 2000. Value-Function Approximations for Par-
tially Observable Markov Decision Processes, Journal of Articial Intelli-
gence Research, Vol. 13, pp. 33-95.
[Hay08] Haykin, S., 2008. Neural Networks and Learning Machines (3rd
Edition), Prentice-Hall, Englewood-Clis, N. J.
[He02] He, Y., 2002. Simulation-Based Algorithms for Markov Decision
Processes, Ph.D. Thesis, University of Maryland.
[JJS94] Jaakkola, T., Jordan, M. I., and Singh, S. P., 1994. On the
Convergence of Stochastic Iterative Dynamic Programming Algorithms,
Neural Computation, Vol. 6, pp. 1185-1201.
[JSJ95] Jaakkola, T., Singh, S. P., and Jordan, M. I., 1995. Reinforcement
Learning Algorithm for Partially Observable Markov Decision Problems,
Advances in Neural Information Processing Systems, Vol. 7, pp. 345-352.
[JuP07] Jung, T., and Polani, D., 2007. Kernelizing LSPE(), in Proc.
2007 IEEE Symposium on Approximate Dynamic Programming and Rein-
forcement Learning, Honolulu, Hawaii. pp. 338-345.
[KMP06] Keller, P. W., Mannor, S., and Precup, D., 2006. Automatic
Basis Function Construction for Approximate Dynamic Programming and
Reinforcement Learning, Proc. of the 23rd ICML, Pittsburgh, Penn.
[Kak02] Kakade, S., 2002. A Natural Policy Gradient, Proc. Advances
in Neural Information Processing Systems, Vancouver, BC, Vol. 14, pp.
1531-1538.
[KoB99] Konda, V. R., and Borkar, V. S., 1999. Actor-Critic Like Learn-
ing Algorithms for Markov Decision Processes, SIAM J. on Control and
Optimization, Vol. 38, pp. 94-123.
[KoP00] Koller, K., and Parr, R., 2000. Policy Iteration for Factored
546 References
MDPs, Proc. of the 16th Annual Conference on Uncertainty in AI, pp.
326-334.
[KoT99] Konda, V. R., and Tsitsiklis, J. N., 1999. Actor-Critic Algo-
rithms, Proc. 1999 Neural Information Processing Systems Conference,
Denver, Colorado, pp. 1008-1014.
[KoT03] Konda, V. R., and Tsitsiklis, J. N., 2003. Actor-Critic Algo-
rithms, SIAM J. on Control and Optimization, Vol. 42, pp. 1143-1166.
[Kon02] Konda, V. R., 2002. Actor-Critic Algorithms, Ph.D. Thesis, Dept.
of EECS, M.I.T., Cambridge, MA.
[KuY03] Kushner, H. J., and Yin, G. G., 2003. Stochastic Approximation
and Recursive Algorithms and Applications, 2nd Edition, Springer-Verlag,
N. Y.
[Kra72] Krasnoselskii, M. A., et. al, 1972. Approximate Solution of Opera-
tor Equations, Translated by D. Louvish, Wolters-Noordho Pub., Gronin-
gen.
[LLL08] Lewis, F. L., Liu, D., and Lendaris, G. G., 2008. Special Issue on
Adaptive Dynamic Programming and Reinforcement Learning in Feedback
Control, IEEE Transactions on Systems, Man, and Cybernetics, Part B,
Vol. 38, No. 4.
[LSS09] Li, Y., Szepesvari, C., and Schuurmans, D., 2009. Learning Ex-
ercise Policies for American Options, Proc. of the Twelfth International
Conference on Articial Intelligence and Statistics, Clearwater Beach, Fla.
[LaP03] Lagoudakis, M. G., and Parr, R., 2003. Least-Squares Policy
Iteration, J. of Machine Learning Research, Vol. 4, pp. 1107-1149.
[LeV09] Lewis, F. L., and Vrabie, D., 2009. Reinforcement Learning and
Adaptive Dynamic Programming for Feedback Control, IEEE Circuits
and Systems Magazine, 3rd Q. Issue.
[Liu01] Liu, J. S., 2001. Monte Carlo Strategies in Scientic Computing,
Springer, N. Y.
[LoS01] Longsta, F. A., and Schwartz, E. S., 2001. Valuing American
Options by Simulation: A Simple Least-Squares Approach, Review of
Financial Studies, Vol. 14, pp. 113-147.
[MMS06] Menache, I., Mannor, S., and Shimkin, N., 2005. Basis Function
Adaptation in Temporal Dierence Reinforcement Learning, Ann. Oper.
Res., Vol. 134, pp. 215-238.
[MaT01] Marbach, P., and Tsitsiklis, J. N., 2001. Simulation-Based Opti-
mization of Markov Reward Processes, IEEE Transactions on Automatic
Control, Vol. 46, pp. 191-209.
[MaT03] Marbach, P., and Tsitsiklis, J. N., 2003. Approximate Gradient
References 547
Methods in Policy-Space Optimization of Markov Reward Processes, J.
Discrete Event Dynamic Systems, Vol. 13, pp. 111-148.
[Mah96] Mahadevan, S., 1996. Average Reward Reinforcement Learning:
Foundations, Algorithms, and Empirical Results, Machine Learning, Vol.
22, pp. 1-38.
[Mar70] Martinet, B., 1970. Regularisation d Inequations Variationnelles
par Approximations Successives, Rev. Francaise Inf. Rech. Oper., Vol. 2,
pp. 154-159.
[Mey07] Meyn, S., 2007. Control Techniques for Complex Networks, Cam-
bridge University Press, N. Y.
[MuS08] Munos, R., and Szepesvari, C, 2008. Finite-Time Bounds for
Fitted Value Iteration, Journal of Machine Learning Research, Vol. 1, pp.
815-857.
[Mun03] Munos, R., 2003. Error Bounds for Approximate Policy Itera-
tion, Proc. 20th International Conference on Machine Learning, pp. 560-
567.
[NeB03] Nedic, A., and Bertsekas, D. P., 2003. Least-Squares Policy Eval-
uation Algorithms with Linear Function Approximation, J. of Discrete
Event Systems, Vol. 13, pp. 79-110.
[OrS02] Ormoneit, D., and Sen, S., 2002. Kernel-Based Reinforcement
Learning, Machine Learning, Vol. 49, pp. 161-178.
[PSD01] Precup, D., Sutton, R. S., and Dasgupta, S., 2001. O-Policy
Temporal-Dierence Learning with Function Approximation, In Proc. 18th
Int. Conf. Machine Learning, pp. 417424.
[PWB09] Polydorides, N., Wang, M., and Bertsekas, D. P., 2009. Approx-
imate Solution of Large-Scale Linear Inverse Problems with Monte Carlo
Simulation, Lab. for Information and Decision Systems Report LIDS-P-
2822, MIT.
[Pin97] Pineda, F., 1997. Mean-Field Analysis for Batched TD(), Neural
Computation, pp. 1403-1419.
[PoB04] Poupart, P., and Boutilier, C., 2004. Bounded Finite State Con-
trollers, Advances in Neural Information Processing Systems.
[PoV04] Powell, W. B., and Van Roy, B., 2004. Approximate Dynamic
Programming for High-Dimensional Resource Allocation Problems, in Le-
arning and Approximate Dynamic Programming, by J. Si, A. Barto, W.
Powell, (Eds.), IEEE Press, N. Y.
[Pow07] Powell, W. B., 2007. Approximate Dynamic Programming: Solving
the Curses of Dimensionality, J. Wiley and Sons, Hoboken, N. J.
[Roc76] Rockafellar, R. T.,Monotone Operators and the Proximal Point
548 References
Algorithm, SIAM J. on Control and Optimization, Vol. 14, 1976, pp. 877-
898.
[RuK04] Rubinstein, R. Y., and Kroese, D. P., 2004. The Cross-Entropy
Method: A Unied Approach to Combinatorial Optimization, Springer, N.
Y.
[RuK08] Rubinstein, R. Y., and Kroese, D. P., 2008. Simulation and the
Monte Carlo Method, 2nd Edition, J. Wiley, N. Y.
[SBP04] Si, J., Barto, A., Powell, W., and Wunsch, D., (Eds.) 2004. Learn-
ing and Approximate Dynamic Programming, IEEE Press, N. Y.
[SDG09] Simao, D. G., Day, S., George, A. P., Giord, T., Nienow, J., and
Powell, W. B., 2009. An Approximate Dynamic Programming Algorithm
for Large-Scale Fleet Management: A Case Application, Transportation
Science, Vol. 43, pp. 178197.
[SJJ94] Singh, S. P., Jaakkola, T., and Jordan, M. I., 1994. Learning
without State-Estimation in Partially Observable Markovian Decision Pro-
cesses, Proceedings of the Eleventh Machine Learning Conference, pp.
284-292.
[SJJ95] Singh, S. P., Jaakkola, T., and Jordan, M. I., 1995. Reinforcement
Learning with Soft State Aggregation, in Advances in Neural Information
Processing Systems 7, MIT Press, Cambridge, MA.
[SMS99] Sutton, R. S., McAllester, D., Singh, S. P., and Mansour, Y.,
1999. Policy Gradient Methods for Reinforcement Learning with Func-
tion Approximation, Proc. 1999 Neural Information Processing Systems
Conference, Denver, Colorado.
[SYL04] Si, J., Yang, L., and Liu, D., 2004. Direct Neural Dynamic Pro-
gramming, in Learning and Approximate Dynamic Programming, by J.
Si, A. Barto, W. Powell, (Eds.), IEEE Press, N. Y.
[Saa03] Saad, Y., 2003. Iterative Methods for Sparse Linear Systems, SIAM,
Phila., Pa.
[Sam59] Samuel, A. L., 1959. Some Studies in Machine Learning Using
the Game of Checkers, IBM Journal of Research and Development, pp.
210-229.
[Sam67] Samuel, A. L., 1967. Some Studies in Machine Learning Using
the Game of Checkers. II Recent Progress, IBM Journal of Research
and Development, pp. 601-617.
[ScS85] Schweitzer, P. J., and Seidman, A., 1985. Generalized Polyno-
mial Approximations in Markovian Decision Problems, J. Math. Anal.
and Appl., Vol. 110, pp. 568-582.
[Sch93] Schwartz, A., 1993. A Reinforcement Learning Method for Maxi-
References 549
mizing Undiscounted Rewards, Proceedings of the Tenth Machine Learn-
ing Conference, pp. 298-305.
[Sch07] Scherrer, B., 2007. Performance Bounds for Lambda Policy Itera-
tion, Technical Report 6348, INRIA.
[Sch10] Scherrer, B., 2010. Should One Compute the Temporal Dierence
Fix Point or Minimize the Bellman Residual? The Unied Oblique Projec-
tion View, in ICML10: Proc. of the 27th Annual International Conf. on
Machine Learning.
[Sha53] Shapley, L. S., 1953. Stochastic Games, Proc. Nat. Acad. Sci.
U.S.A., Vol. 39.
[Sin94] Singh, S. P., 1994. Reinforcement Learning Algorithms for Average-
Payo Markovian Decision Processes, Proc. of 12th National Conference
on Articial Intelligence, pp. 202-207.
[Str09] Strang, G., 2009. Linear Algebra and its Applications, Wellesley
Cambridge Press, Welleslay, MA.
[SuB98] Sutton, R. S., and Barto, A. G., 1998. Reinforcement Learning,
MIT Press, Cambridge, MA.
[Sut88] Sutton, R. S., 1988. Learning to Predict by the Methods of Tem-
poral Dierences, Machine Learning, Vol. 3, pp. 9-44.
[SzL06] Szita, I., and Lorinz, A., 2006. Learning Tetris Using the Noisy
Cross-Entropy Method, Neural Computation, Vol. 18, pp. 2936-2941.
[SzS04] Szepesvari, C., and Smart, W. D., 2004. Interpolation-Based Q-
Learning, Proc. of 21st International Conf. on Machine Learning, Ban,
Ca.
[Sze09] Szepesvari, C., 2009. Reinforcement Learning Algorithms for MDPs,
Dept. of Computing Science Report TR09-13, University of Alberta, Ca.
[Tes92] Tesauro, G., 1992. Practical Issues in Temporal Dierence Learn-
ing, Machine Learning, Vol. 8, pp. 257-277.
[ThS09] Thiery, C., and Scherrer, B., 2009. Improvements on Learning
Tetris with Cross-Entropy, International Computer Games Association
Journal, Vol. 32, pp. 23-33.
[ThS10a] Thiery, C., and Scherrer, B., 2010. Least-Squares -Policy Iter-
ation: Bias-Variance Trade-o in Control Problems, in ICML10: Proc. of
the 27th Annual International Conf. on Machine Learning.
[ThS10b] Thiery, C., and Scherrer, B., 2010. Performance Bound for Ap-
proximate Optimistic Policy Iteration, Technical Report, INRIA.
[ToS06] Toussaint, M., and Storkey, A. 2006. Probabilistic Inference for
Solving Discrete and Continuous State Markov Decision Processes, in
550 References
Proc. of the 23nd ICML, pp. 945-952.
[TrB97] Trefethen, L. N., and Bau, D., 1997. Numerical Linear Algebra,
SIAM, Phila., PA.
[TsV96] Tsitsiklis, J. N., and Van Roy, B., 1996. Feature-Based Methods
for Large-Scale Dynamic Programming, Machine Learning, Vol. 22, pp.
59-94.
[TsV97] Tsitsiklis, J. N., and Van Roy, B., 1997. An Analysis of Temporal-
Dierence Learning with Function Approximation, IEEE Transactions on
Automatic Control, Vol. 42, pp. 674-690.
[TsV99a] Tsitsiklis, J. N., and Van Roy, B., 1999. Average Cost Temporal-
Dierence Learning, Automatica, Vol. 35, pp. 1799-1808.
[TsV99b] Tsitsiklis, J. N., and Van Roy, B., 1999. Optimal Stopping of
Markov Processes: Hilbert Space Theory, Approximation Algorithms, and
an Application to Pricing Financial Derivatives, IEEE Transactions on
Automatic Control, Vol. 44, pp. 1840-1851.
[TsV01] Tsitsiklis, J. N., and Van Roy, B., 2001. Regression Methods for
Pricing Complex American-Style Options, IEEE Trans. on Neural Net-
works, Vol. 12, pp. 694-703.
[TsV02] Tsitsiklis, J. N., and Van Roy, B., 2002. On Average Versus Dis-
counted Reward TemporalDierence Learning, Machine Learning, Vol.
49, pp. 179-191.
[Tsi94] Tsitsiklis, J. N., 1994. Asynchronous Stochastic Approximation
and Q-Learning, Machine Learning, Vol. 16, pp. 185-202.
[VBL07] Van Roy, B., Bertsekas, D. P., Lee, Y., and Tsitsiklis, J. N.,
1997. A Neuro-Dynamic Programming Approach to Retailer Inventory
Management, Proc. of the IEEE Conference on Decision and Control;
based on a more extended Lab. for Information and Decision Systems
Report, MIT, Nov. 1996.
[Van95] Van Roy, B., 1995. Feature-Based Methods for Large Scale Dy-
namic Programming, Lab. for Info. and Decision Systems Report LIDS-
TH-2289, Massachusetts Institute of Technology, Cambridge, MA.
[Van98] Van Roy, B., 1998. Learning and Value Function Approximation
in Complex Decision Processes, Ph.D. Thesis, Dept. of EECS, MIT, Cam-
bridge, MA.
[Van06] Van Roy, B., 2006. Performance Loss Bounds for Approximate
Value Iteration with State Aggregation, Mathematics of Operations Re-
search, Vol. 31, pp. 234-244.
[VeR06] Verma, R., and Rao, R. P. N., 2006. Planning and Acting in Un-
certain Environments Using Probabilistic Inference, in Proc. of IEEE/RSJ
References 551
Intern. Conf. on Intelligent Robots and Systems.
[WPB09] Wang, M., Polydorides, N., and Bertsekas, D. P., 2009. Approx-
imate Simulation-Based Solution of Large-Scale Least Squares Problems,
Lab. for Information and Decision Systems Report LIDS-P-2819, MIT.
[WaB92] Watkins, C. J. C. H., and Dayan, P., 1992. Q-Learning, Machine
Learning, Vol. 8, pp. 279-292.
[Wat89] Watkins, C. J. C. H., Learning from Delayed Rewards, Ph.D. The-
sis, Cambridge Univ., England.
[WeB99] Weaver, L., and Baxter, J., 1999. Reinforcement Learning From
State and Temporal Dierences, Tech. Report, Department of Computer
Science, Australian National University.
[WiB93] Williams, R. J., and Baird, L. C., 1993. Analysis of Some In-
cremental Variants of Policy Iteration: First Steps Toward Understanding
Actor-Critic Learning Systems, Report NU-CCS-93-11, College of Com-
puter Science, Northeastern University, Boston, MA.
(See http://web.mit.edu/dimitrib/www/Williams-Baird-Counterexample.pdf
for a description of this example in a format that is adapted to our context
in this chapter.)
[Wil92] Williams, R. J., 1992. Simple Statistical Gradient Following Algo-
rithms for Connectionist Reinforcement Learning, Machine Learning, Vol.
8, pp. 229-256.
[YaL08] Yao, H., and Liu, Z.-Q., 2008. Preconditioned Temporal Dier-
ence Learning, Proc. of the 25th ICML, Helsinki, Finland.
[YuB04] Yu, H., and Bertsekas, D. P., 2004. Discretized Approximations
for POMDP with Average Cost, Proc. of the 20th Conference on Uncer-
tainty in Articial Intelligence, Ban, Canada.
[YuB06a] Yu, H., and Bertsekas, D. P., 2006. On Near-Optimality of the
Set of Finite-State Controllers for Average Cost POMDP, Lab. for Infor-
mation and Decision Systems Report 2689, MIT; Mathematics of Opera-
tions Research, Vol. 33, pp. 1-11, 2008.
[YuB06b] Yu, H., and Bertsekas, D. P., 2006. Convergence Results for
Some Temporal Dierence Methods Based on Least Squares, Lab. for
Information and Decision Systems Report 2697, MIT; also in IEEE Trans-
actions on Aut. Control, Vol. 54, 2009, pp. 1515-1531.
[YuB07] Yu, H., and Bertsekas, D. P., 2007. A Least Squares Q-Learning
Algorithm for Optimal Stopping Problems, Lab. for Information and Deci-
sion Systems Report 2731, MIT; also in Proc. European Control Conference
2007, Kos, Greece.
[YuB08] Yu, H., and Bertsekas, D. P., 2008. Error Bounds for Approxima-
tions from Projected Linear Equations, Lab. for Information and Decision
552 References
Systems Report LIDS-P-2797, MIT, July 2008; Mathematics of Operations
Research, Vol. 35, 2010, pp. 306-329.
[YuB09] Yu, H., and Bertsekas, D. P., 2009. Basis Function Adaptation
Methods for Cost Approximation in MDP, Proceedings of 2009 IEEE
Symposium on Approximate Dynamic Programming and Reinforcement
Learning (ADPRL 2009), Nashville, Tenn.
[YuB11] Yu, H., and Bertsekas, D. P., 2011. On Boundedness of Q-Learning
Iterates for Stochastic Shortest Path Problems, Lab. for Information and
Decision Systems Report LIDS-P-2859, MIT, March 2011.
[Yu05] Yu, H., 2005. A Function Approximation Approach to Estimation
of Policy Gradient for POMDP with Structured Policies, Proc. of the 21st
Conference on Uncertainty in Articial Intelligence, Edinburgh, Scotland.
[Yu10a] Yu, H., 2010. Least Squares Temporal Dierence Methods: An
Analysis Under General Conditions, Technical report C-2010-39, Dept.
Computer Science, Univ. of Helsinki.
[Yu10b] Yu, H., 2010. Convergence of Least Squares Temporal Dierence
Methods Under General Conditions, Proc. of the 27th ICML, Haifa, Israel.
[ZFM10] Zhou, E., Fu, M. C., and Marcus, S. I., 2010. Solving Continuous-
state POMDPs via Density Projection, IEEE Trans. Automatic Control,
Vol. AC-55, pp. 11011116.
[ZhH01] Zhou, R., and Hansen, E. A., 2001. An Improved Grid-Based
Approximation Algorithm for POMDPs, In Int. J. Conf. Articial Intelli-
gence.
[ZhL97] Zhang, N. L., and Liu, W., 1997. A Model Approximation Scheme
for Planning in Partially Observable Stochastic Domains, J. Articial In-
telligence Research, Vol. 7, pp. 199-230.

Dynamic Programming and Optimal Control 3rd Edition, Volume II

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dynamic Programming and Optimal Control 3rd Edition, Volume II

Uploaded by

Copyright:

Available Formats

Dynamic Programming and Optimal Control

3rd Edition, Volume II

as its unique xed point [here T(r) denotes

. We will discuss several iterative methods for nding r

, so they produce the same approximate

(i, u) over u U(i), so the transition probabilities

(i, u) is the expression minimized in Bellmans equation, given the

(i, u) of specic policies . A major use of such ap-

associated with state i, and

of are explicitly available to use in the various simulation-

g yield an approximation based on the rst s+1 terms

q yield an approximation based on the rst s +1

g, m 0, one would need to generate the ith

g)(i), and rely on the averaging mechanism of simulation to

is the ith row of (). Alternatively, we may

(i). We generate an improved policy using the

(r) = r Using Simulation

Q(i, u, r) is a parametric architecture, possibly of the linear form

State-Control Pairs: Fixed Policy

J S that matches best J

in some normed error sense, i.e.,

(i) can only be estimated

is a contraction, provided we use

does not extend to the case where T

k(i), i = 1, . . . , n, that solve the linear system

is calculated (or approximated), the optimal policy can be computed as

(i) = arg min

of a stationary policy is the unique solution of the corresponding Bellman

, the improved policy is obtained as

can be done in model-free fashion, without explicit knowledge of the

is the unique solution of the equation

(m) should be interpreted as the optimal cost-to-go from post-

of samples from , with each sample

= v/z and the corresponding minimum variance value is 0. However,

cannot be computed without knowledge of z. Instead, is usually chosen to be

are positive functions, and then estimate separately

(i). We generate an improved policy using the

, we select a subset of representative states

S (perhaps obtained by some form of simulation), and for each i

(i). The mth such sample is denoted by

(i) plus some sim-

is a row vector of features corresponding to state i. In this case

J denotes gradient with respect to r and is a positive stepsize,

(i) with a linear architecture of the form

, can also be equivalently expressed as

of this equation as an approximation to J

depends only on the projection norm and the subspace S, and

with the xed point of T.

be the xed point of T. We have

], the second equality holds because J

is the xed point of

is the xed point of T, and the inequality uses the contraction

satises the orthogonality

is the projection of g+Pr

solves the problem

Mr for all r ,= 0, or equivalently

= I; see [Str09], [TrB97]). We also have C

denotes the ith row of .

may be large without e

with the understanding that these approaches will also be

(I P) but also in the formula

by a form of regularized regression, which works even if C