You are on page 1of 1

X
V π (s) = E {rt | s0 = s}
t=1
" #
X X
= π(s, a) R(s, a) + γ P (s, s0 , a)V π (s0 )
a s0

dπ (s0 ) = lim P r {st = s0 | s0 , π} (does not depend on s0 )


t→∞
X X
= dπ (s) π(s, a)P (s, s0 , a)
s a

T
1X
ρπ = lim rt (does not depend on s0 )
T →∞ T
t=1
X X
= dπ (s) π(s, a)R(s, a)
s a

In tryingPto form an overall discounted performance measure for π, can we use


π π
J(π) = s d (s)V (s)? It turns out we then end up with no effect of the
discounting:
X
J(π) = dπ (s)V π (s)
s
" #
X X X
= dπ (s) π(s, a) R(s, a) + γ P (s, s0 , a)V π (s0 )
s a s0
X X X
= ρπ + γ dπ (s) π(s, a) P (s, s0 , a)V π (s0 )
s a s0
X X X
π π 0 π
= ρ +γ V (s ) d (s) π(s, a)P (s, s0 , a)
s0 s a
X
π π 0 π 0
= ρ +γ V (s )d (s )
s0
= ρπ + γJ(π)
= ρπ + γρπ + γ 2 J(π)
= ρπ + γρπ + γ 2 ρπ + γ 3 ρπ + · · ·
1
= ρπ
1−γ
which is basically a scaled ρπ , with no effect of discounting.

You might also like