PRESENTACION Programacion Neurodinamica

PROGRAMACIN
NEURODINAMICA
CAPTULO 12
Simon Haykin
12/11/2017 NEURODYNAMIC PROGRAMMING 1

ORGANIZACIN DEL TEMA

OBJETIVO
Aprendizaje
Con profesor (Supervisado) Sin Profesor
Reforzado Autoorganizativa
Modern Classical
Planning Punishment and Reward
Programacin Dinmica Redes Neuronales

Fundamento Terico Capacidad de aprendiizaje
Debe haber un balance entre el bajo costo actual con un alto costo a futuro

PRINCIPIO DE OPTIMALIDAD DE
BELLMAN
Principle of optimality: Una poltica ptima tiene la propiedad de que, cualesquiera
sean el estado y las decisiones iniciales tomadas, las restantes decisiones deben
constituir una poltica ptima con independencia del estado resultante de la primera
decisin.
DYNAMIC PROGRAMMING ALGORITHM
Para cualquier estado inicial X0, el costo ptimo J*( X0 ) del problema de horizonte finito
es igual a J0 ( X0 ), donde la funcin J0 es obtenida del paso anterior del algoritmo:
J n ( X n ) min E g n X n , n X n , X n 1 J n 1 ( X n 1 )
n X n1
El mismo que funciona a traves del tiempo
Costo terminal
J K ( X K ) gK ( X K ) ( K = horizonte )
Si n* minimiza E, para cada Xn y n, entonces la poltica * = { 0*, 1*, , K-1* } es

ptima.

PRINCIPIO DE OPTIMALIDAD DE
BELLMAN
ECUACION DE OPTIMALIDAD DE BELLMAN
La conexin entre un problema de horizonte finito

y uno infinito utlilizando como poltica de J (i) lim J K (i), i
optimizacin = { , , , }. K
J n 1 ( X 0 ) min E g X 0 , X 0 , X 1 J n ( X 1 ) J 0 ( X ) 0, X
X1
Si n+1 = K y X0 = i, entonces: Probabilidad de

transicin
N
Costo
J (i) min c i, (i) pij ( ) J ( j )

inmediato
esperado j 1
N N
E g i, (i), X1 pij g i, (i), j E J
X1 pij J ( j )
j 1 j 1

VALUE ITERATION
VALUE ITERATION ALGORITHM
1. Comenzar con un valor arbitrario de J0 (i) para el estado i = 1,2,,N.
2. Para n = 0,1,2,, calcular:
N
J n1 (i) min c(i, a) pij (a) J n ( j )
a
j 1 Tolerance
parameter
J n1 (i) J n (i) J n (i) J (i)

3. Calcular Q - factor:
n
Q (i, a) c(i, a) pij (a) J ( j )

j 1
Por lo tanto, determinar la poltica ptima como poltica codiciosa para J* (i):
(i) arg min Q (i, a)

a

STAGECOACH PROBLEM
7 1
B E H 3 Actions (a)
4 4
6 Up, down, straight, etc
2 3 6
4 2
A C F 3
J Eight states (i)
3 4 A, B, C, , J
4
1 3 4
Q (i , a )
D G I
5 3
Stage 1 Stage 2 Stage 3 Stage 4
11 4 3
7 1
B E H 3
Three optimal routes:
4
ACEHJ
11 7 7 ADEHJ
4
A C 3
F 3 J ADFIJ
3
4
J 11
8 6 4
1 3 4
D G I

Q-LEARNING
Q-Learning se define como una forma de aprender modelo-libre del
aprendizaje reforzado[Watkins, 1989; Watkins y Dayan, 1992; Jang, et.
Todos, 1997].
Es utilizado para un problema donde el Markovian tiene informacin
incompleta y fundamentado en la funcin accin-estado Q con mapas de
pares accin-estado con espectativa de retorno, siendo este el fundamento
del Q-learning
Podra ser vista como versin incremental de Programacin Dinmica
que mejora sucesivamente sus evaluaciones de acciones especficas en los
estados especficos. Un par del estado s y la accin a estn
sosteniendo por Q-Learning.
Los valores de la poltica ptima son los objetivos en Q-Learning y el
valor de su estado se define mientras que l valora del mejor par de
estado-accin del estado
El de funcin de poltica es la poltica ptima y expresado como
s = a teniendo que V s = Q s, a = max b actions Q s, b

Q-Learning
The processing of the Q-Learning is developed after one-step delay the action-
value function Q of the most current state-action pair is update.
In the n-th stage, the update rule for Q-value (based on TD methods) is defined
as:
Qn 1 s, a n rn Vn 1 tn Qn 1 s, a if s sn and a an ,
Qn s, a
Qn 1 s, a Otherwise,
and Vn 1 t maxb actions Qn 1 t, b
where sn and an are the current state and the selected action, respectively.
The expected Q-value is the current Q-value of taking action a in state s

and then using the optimal actions in all future states. In the repeatedly
process to try all actions in all states, the agent learns which are best overall,
judge by the long-term discounted reward.

Q-Learning
1 2
Start
Observation of the u rn max bactions Qn 1 t , b

current state: sn Execute the action
an
Select an action an
Observation of the Adjust the Q-net by backpropagation the
by a stochastic procedure sucessive state t n one-step error
and reinforcement
rn : t t n u U a if s sn and a an
U
0 otherwise
For the selected action

an
use the Q-net to compute Use the Q-net to compute
U a : U a Qn 1 sn , an Qn 1 t , b , b actions
End
1
2
Flowchart of the One-step Q-learning Algorithm using NN approximators

Q-Learning
The last algorithm have to be applied in the next architecture of NN:
State
Q-Value
Action
Q-net architecture
BIBLIOGRAPHY PART I
Simon Haykin, 1999, Neural Networks: A Comprehensive
Foundation.
Bertsekas and Tsitsiklis, 1996, Neuro-Dynamic Programming.
Hillier and Lieberman, 1997, Introduccin a la investigacin de
operaciones.
Patio, Fullana and Schugurensky, 2004, Programacin Dinmica.

PRESENTACION Programacion Neurodinamica

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PRESENTACION Programacion Neurodinamica

Uploaded by

Copyright:

Available Formats

PROGRAMACIN

12/11/2017 NEURODYNAMIC PROGRAMMING 1

12/11/2017 NEURODYNAMIC PROGRAMMING 2

Con profesor (Supervisado) Sin Profesor

Programacin Dinmica Redes Neuronales

12/11/2017 NEURODYNAMIC PROGRAMMING 3

Si n* minimiza E, para cada Xn y n, entonces la poltica * = { 0*, 1*, , K-1* } es

12/11/2017 NEURODYNAMIC PROGRAMMING 4

La conexin entre un problema de horizonte finito

Si n+1 = K y X0 = i, entonces: Probabilidad de

12/11/2017 NEURODYNAMIC PROGRAMMING 5

J n1 (i) J n (i) J n (i) J (i)

(i) arg min Q (i, a)

12/11/2017 NEURODYNAMIC PROGRAMMING 6

12/11/2017 NEURODYNAMIC PROGRAMMING 7

s = a teniendo que V s = Q s, a = max b actions Q s, b

12/11/2017 NEURODYNAMIC PROGRAMMING 8

and Vn 1 t maxb actions Qn 1 t, b

The expected Q-value is the current Q-value of taking action a in state s

12/11/2017 NEURODYNAMIC PROGRAMMING 9

Observation of the u rn max bactions Qn 1 t , b

For the selected action

Flowchart of the One-step Q-learning Algorithm using NN approximators

12/11/2017 NEURODYNAMIC PROGRAMMING 12

You might also like

Si n* minimiza E, para cada Xn y n, entonces la poltica * = { 0, 1, , K-1* } es