Professional Documents
Culture Documents
NEURODINAMICA
CAPTULO 12
Simon Haykin
Reforzado Autoorganizativa
Modern Classical
Planning Punishment and Reward
Debe haber un balance entre el bajo costo actual con un alto costo a futuro
Para cualquier estado inicial X0, el costo ptimo J*( X0 ) del problema de horizonte finito
es igual a J0 ( X0 ), donde la funcin J0 es obtenida del paso anterior del algoritmo:
J n ( X n ) min E g n X n , n X n , X n 1 J n 1 ( X n 1 )
n X n1
El mismo que funciona a traves del tiempo
Costo terminal
J K ( X K ) gK ( X K ) ( K = horizonte )
J n 1 ( X 0 ) min E g X 0 , X 0 , X 1 J n ( X 1 ) J 0 ( X ) 0, X
X1
N
Costo
J (i) min c i, (i) pij ( ) J ( j )
inmediato
esperado j 1
N N
E g i, (i), X1 pij g i, (i), j E J
X1 pij J ( j )
j 1 j 1
N
J n1 (i) min c(i, a) pij (a) J n ( j )
a
j 1 Tolerance
parameter
n
Q (i, a) c(i, a) pij (a) J ( j )
j 1
Por lo tanto, determinar la poltica ptima como poltica codiciosa para J* (i):
1 3 4
Q (i , a )
D G I
5 3
Stage 1 Stage 2 Stage 3 Stage 4
11 4 3
7 1
B E H 3
Three optimal routes:
4
ACEHJ
11 7 7 ADEHJ
4
A C 3
F 3 J ADFIJ
3
4
J 11
8 6 4
1 3 4
D G I
In the n-th stage, the update rule for Q-value (based on TD methods) is defined
as:
Qn 1 s, a n rn Vn 1 tn Qn 1 s, a if s sn and a an ,
Qn s, a
Qn 1 s, a Otherwise,
where sn and an are the current state and the selected action, respectively.
Select an action an
Observation of the Adjust the Q-net by backpropagation the
by a stochastic procedure sucessive state t n one-step error
and reinforcement
rn : t t n u U a if s sn and a an
U
0 otherwise
1
2
State
Q-Value
Action
Q-net architecture
12/11/2017 NEURODYNAMIC PROGRAMMING 11
BIBLIOGRAPHY PART I
Simon Haykin, 1999, Neural Networks: A Comprehensive
Foundation.
Bertsekas and Tsitsiklis, 1996, Neuro-Dynamic Programming.
Hillier and Lieberman, 1997, Introduccin a la investigacin de
operaciones.
Patio, Fullana and Schugurensky, 2004, Programacin Dinmica.