publication . Article . 2013

An Equivalence Between Adaptive Dynamic Programming With a Critic and Backpropagation Through Time

Fairbank, M.; Alonso, E.; Prokhorov, D.;
Open Access
  • Published: 01 Dec 2013 Journal: IEEE Transactions on Neural Networks and Learning Systems, volume 24, pages 2,088-2,100 (issn: 2162-237X, eissn: 2162-2388, Copyright policy)
  • Publisher: Institute of Electrical and Electronics Engineers (IEEE)
  • Country: United Kingdom
Abstract
We consider the adaptive dynamic programming technique called Dual Heuristic Programming (DHP), which is designed to learn a critic function, when using learned model functions of the environment. DHP is designed for optimizing control problems in large and continuous state spaces. We extend DHP into a new algorithm that we call Value-Gradient Learning, VGL(λ), and prove equivalence of an instance of the new algorithm to Backpropagation Through Time for Control with a greedy policy. Not only does this equivalence provide a link between these two different approaches, but it also enables our variant of DHP to have guaranteed convergence, under certain smoothness ...
Subjects
free text keywords: Computer Networks and Communications, Software, Artificial Intelligence, Computer Science Applications, Smoothness, Dynamic programming, Nonlinear system, Backpropagation through time, Machine learning, computer.software_genre, computer, Backpropagation, Inductive programming, Mathematical optimization, Equivalence (measure theory), Convergence (routing), business.industry, business, Computer science, BF, QA75, RC0321
Related Organizations
28 references, page 1 of 2

[1] F.-Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: An introduction,” IEEE Computational Intelligence Magazine, vol. 4, no. 2, pp. 39-47, 2009.

[2] R. E. Bellman, Dynamic Programming. Princeton, NJ, USA: Princeton University Press, 1957.

[3] P. J. Werbos, “Approximating dynamic programming for real-time control and neural modeling.” in Handbook of Intelligent Control, White and Sofge, Eds. New York: Van Nostrand Reinhold, 1992, ch. 13, pp. 493-525.

[4] D. Prokhorov and D. Wunsch, “Adaptive critic designs,” IEEE Transactions on Neural Networks, vol. 8, no. 5, pp. 997-1007, 1997.

[5] S. Ferrari and R. F. Stengel, “Model-based adaptive critic designs,” in Handbook of learning and approximate dynamic programming, J. Si, A. Barto, W. Powell, and D. Wunsch, Eds. New York: Wiley-IEEE Press, 2004, pp. 65-96.

[6] M. Fairbank, E. Alonso, and D. Prokhorov, “Simple and fast calculation of the second-order gradients for globalized dual heuristic dynamic programming in neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 10, pp. 1671-1678, October 2012.

[7] M. Fairbank, “Reinforcement learning by value gradients,” CoRR, vol. abs/0803.3539, 2008. [Online]. Available: http://arxiv.org/abs/0803.3539 [OpenAIRE]

[8] M. Fairbank and E. Alonso, “Value-gradient learning,” in Proceedings of the IEEE International Joint Conference on Neural Networks 2012 (IJCNN'12). IEEE Press, June 2012, pp. 3062-3069.

[9] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine Learning, vol. 3, pp. 9-44, 1988.

[10] G. K. Venayagamoorthy and D. C. Wunsch, “Dual heuristic programming excitation neurocontrol for generators in a multimachine power system,” IEEE Transactions on Industry Applications, vol. 39, pp. 382- 394, 2003.

[11] G. G. Lendaris and C. Paintz, “Training strategies for critic and action neural networks in dual heuristic programming method,” in Proceedings of International Conference on Neural Networks, Houston, 1997.

[12] L. S. Pontryagin, V. G. Boltayanskii, R. V. Gamkrelidze, and E. F. Mishchenko, The Mathematical Theory of Optimal Processes (Translated from Russian). Wiley, 1962, vol. 4.

[13] M. Fairbank and E. Alonso, “The local optimality of reinforcement learning by value gradients, and its relationship to policy gradient learning,” CoRR, vol. abs/1101.0428, 2011. [Online]. Available: http://arxiv.org/abs/1101.0428 [OpenAIRE]

[14] --, “A comparison of learning speed and ability to cope without exploration between DHP and TD(0),” in Proceedings of the IEEE International Joint Conference on Neural Networks 2012 (IJCNN'12). IEEE Press, June 2012, pp. 1478-1485.

[15] P. J. Werbos, T. McAvoy, and T. Su, “Neural networks, system identification, and control in the chemical process industries.” in Handbook of Intelligent Control, White and Sofge, Eds. New York: Van Nostrand Reinhold, 1992, ch. 10, pp. 283-356.

28 references, page 1 of 2
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue