An Equivalence Between Adaptive Dynamic Programming With a Critic and Backpropagation Through Time
- Publisher: IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
RC0321 | BF | QA75
We consider the adaptive dynamic programming technique called Dual Heuristic Programming (DHP), which is designed to learn a critic function, when using learned model functions of the environment. DHP is designed for optimizing control problems in large and continuous state spaces. We extend DHP into a new algorithm that we call Value-Gradient Learning, VGL(λ), and prove equivalence of an instance of the new algorithm to Backpropagation Through Time for Control with a greedy policy. Not only does this equivalence provide a link between these two different approaches, but it also enables our variant of DHP to have guaranteed convergence, under certain smoothness conditions and a greedy policy, when using a general smooth nonlinear function approximator for the critic. We consider several experimental scenarios including some that prove divergence of DHP under a greedy policy, which contrasts against our proven-convergent algorithm.
28 references, page 1 of 3
views in local repository
downloads in local repository
The information is available from the following content providers: