An Equivalence Between Adaptive Dynamic Programming With a Critic and Backpropagation Through Time

Article English OPEN
Fairbank, M. ; Alonso, E. ; Prokhorov, D. (2013)

We consider the adaptive dynamic programming technique called Dual Heuristic Programming (DHP), which is designed to learn a critic function, when using learned model functions of the environment. DHP is designed for optimizing control problems in large and continuous state spaces. We extend DHP into a new algorithm that we call Value-Gradient Learning, VGL(λ), and prove equivalence of an instance of the new algorithm to Backpropagation Through Time for Control with a greedy policy. Not only does this equivalence provide a link between these two different approaches, but it also enables our variant of DHP to have guaranteed convergence, under certain smoothness conditions and a greedy policy, when using a general smooth nonlinear function approximator for the critic. We consider several experimental scenarios including some that prove divergence of DHP under a greedy policy, which contrasts against our proven-convergent algorithm.
  • References (28)
    28 references, page 1 of 3

    [1] F.-Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: An introduction,” IEEE Computational Intelligence Magazine, vol. 4, no. 2, pp. 39-47, 2009.

    [2] R. E. Bellman, Dynamic Programming. Princeton, NJ, USA: Princeton University Press, 1957.

    [3] P. J. Werbos, “Approximating dynamic programming for real-time control and neural modeling.” in Handbook of Intelligent Control, White and Sofge, Eds. New York: Van Nostrand Reinhold, 1992, ch. 13, pp. 493-525.

    [4] D. Prokhorov and D. Wunsch, “Adaptive critic designs,” IEEE Transactions on Neural Networks, vol. 8, no. 5, pp. 997-1007, 1997.

    [5] S. Ferrari and R. F. Stengel, “Model-based adaptive critic designs,” in Handbook of learning and approximate dynamic programming, J. Si, A. Barto, W. Powell, and D. Wunsch, Eds. New York: Wiley-IEEE Press, 2004, pp. 65-96.

    [6] M. Fairbank, E. Alonso, and D. Prokhorov, “Simple and fast calculation of the second-order gradients for globalized dual heuristic dynamic programming in neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 10, pp. 1671-1678, October 2012.

    [7] M. Fairbank, “Reinforcement learning by value gradients,” CoRR, vol. abs/0803.3539, 2008. [Online]. Available: http://arxiv.org/abs/0803.3539

    [8] M. Fairbank and E. Alonso, “Value-gradient learning,” in Proceedings of the IEEE International Joint Conference on Neural Networks 2012 (IJCNN'12). IEEE Press, June 2012, pp. 3062-3069.

    [9] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine Learning, vol. 3, pp. 9-44, 1988.

    [10] G. K. Venayagamoorthy and D. C. Wunsch, “Dual heuristic programming excitation neurocontrol for generators in a multimachine power system,” IEEE Transactions on Industry Applications, vol. 39, pp. 382- 394, 2003.

  • Metrics
    0
    views in OpenAIRE
    0
    views in local repository
    74
    downloads in local repository

    The information is available from the following content providers:

    From Number Of Views Number Of Downloads
    City Research Online - IRUS-UK 0 74
Share - Bookmark