publication . Preprint . 2016

Q($\lambda$) with Off-Policy Corrections

Harutyunyan, Anna; Bellemare, Marc G.; Stepleton, Tom; Munos, Remi;
Open Access English
  • Published: 16 Feb 2016
We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities. We prove that such approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided certain conditions. These conditions relate the distance between the target and behavior policies, the eligibility trace parameter and the discount factor, and formalize an underlying tradeoff in off-policy TD($\lambda$). We illustrate this theoretical relationship empiric...
free text keywords: Computer Science - Artificial Intelligence, Computer Science - Learning, Statistics - Machine Learning
Download from
21 references, page 1 of 2

Bellman, Richard. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1st edition, 1957.

Bertsekas, Dimitri P. and Tsitsiklis, John N. Dynamic Programming. Athena Scientific, 1996.

Hallak, Assaf, Tamar, Aviv, Munos, Remi, and Mannor, Shie. Generalized emphatic temporal difference learning: Bias-variance analysis, 2015. [OpenAIRE]

Kahn, Herman and Marshall, Andy W. Methods of reducing sample size in monte carlo computations. Journal of the Operations Research Society of America, 1(5):263- 278, 1953. [OpenAIRE]

Kearns, Michael J and Singh, Satinder P. Bias-variance error bounds for temporal difference updates. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pp. 142-147. Morgan Kaufmann Publishers Inc., 2000.

Mahmood, A Rupam and Sutton, Richard S. Off-policy learning based on weighted importance sampling with linear computational complexity. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, 2015.

Mahmood, A Rupam, Yu, Huizhen, White, Martha, and Sutton, Richard S. Emphatic temporal-difference learning. arXiv preprint arXiv:1507.01569, 2015. [OpenAIRE]

Peng, Jing and Williams, Ronald J. Incremental multi-step q-learning. Machine Learning, 22(1-3):283-290, 1996.

Precup, Doina, Sutton, Richard S, and Singh, Satinder. Eligibility traces for off-policy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning, 2000.

Puterman, Martin L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994. ISBN 0471619779.

Randlov, Jette and Alstrom, Preben. Learning to drive a bicycle using reinforcement learning and shaping. In Proceedings of the Fifteenth International Conference on Machine Learning, 1998.

Rummery, Gavin A and Niranjan, Mahesan. On-line qlearning using connectionist systems. Technical report, Cambridge University Engineering Department., 1994.

Singh, Satinder and Dayan, Peter. Analytical mean squared error curves for temporal difference learning. Machine Learning, 32(1):5-40, 1998.

Singh, Satinder, Jaakkola, Tommi, Littman, Michael L., and Szepesva´ri, Csaba. Convergence results for singlestep on-policy reinforcement-learning algorithms. Machine Learning, 38(3):287-308, 2000.

Sutton, Richard S. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9-44, 1988.

21 references, page 1 of 2
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue