publication . Preprint . 2017

A Distributional Perspective on Reinforcement Learning

Bellemare, Marc G.; Dabney, Will; Munos, Rémi;
Open Access English
  • Published: 21 Jul 2017
Abstract
In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. This is in contrast to the common approach to reinforcement learning which models the expectation of this return, or value. Although there is an established body of literature studying the value distribution, thus far it has always been used for a specific purpose such as implementing risk-aware behaviour. We begin with theoretical results in both the policy evaluation and control settings, exposing a significant distributional instability in the latter. We then use the distributional perspective to de...
Subjects
free text keywords: Computer Science - Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Related Organizations
Download from
44 references, page 1 of 3

Azar, Mohammad Gheshlaghi, Munos, Re´mi, and Kappen, Hilbert. On the sample complexity of reinforcement learning with a generative model. In Proceedings of the International Conference on Machine Learning, 2012. [OpenAIRE]

Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253-279, 2013. [OpenAIRE]

Bellemare, Marc G., Danihelka, Ivo, Dabney, Will, Mohamed, Shakir, Lakshminarayanan, Balaji, Hoyer, Stephan, and Munos, Re´mi. The cramer distance as a solution to biased wasserstein gradients. arXiv, 2017. [OpenAIRE]

Bellman, Richard E. Dynamic programming. Princeton University Press, Princeton, NJ, 1957.

Bertsekas, Dimitri P. and Tsitsiklis, John N. Neuro-Dynamic Programming. Athena Scientific, 1996.

Bickel, Peter J. and Freedman, David A. Some asymptotic theory for the bootstrap. The Annals of Statistics, pp. 1196-1217, 1981. [OpenAIRE]

Billingsley, Patrick. Probability and measure. John Wiley & Sons, 1995.

Caruana, Rich. Multitask learning. Machine Learning, 28(1): 41-75, 1997.

Chung, Kun-Jen and Sobel, Matthew J. Discounted mdps: Distribution functions and exponential utility maximization. SIAM Journal on Control and Optimization, 25(1):49-62, 1987. [OpenAIRE]

Dearden, Richard, Friedman, Nir, and Russell, Stuart. Bayesian Q-learning. In Proceedings of the National Conference on Artificial Intelligence, 1998.

Engel, Yaakov, Mannor, Shie, and Meir, Ron. Reinforcement learning with gaussian processes. In Proceedings of the International Conference on Machine Learning, 2005.

Geist, Matthieu and Pietquin, Olivier. Kalman temporal differences. Journal of Artificial Intelligence Research, 39:483-532, 2010. [OpenAIRE]

Gordon, Geoffrey. Stable function approximation in dynamic programming. In Proceedings of the Twelfth International Conference on Machine Learning, 1995.

Harutyunyan, Anna, Bellemare, Marc G., Stepleton, Tom, and Munos, Re´mi. Q( ) with off-policy corrections. In Proceedings of the Conference on Algorithmic Learning Theory, 2016. [OpenAIRE]

Hoffman, Matthew D., de Freitas, Nando, Doucet, Arnaud, and Peters, Jan. An expectation maximization algorithm for continuous markov decision processes with arbitrary reward. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2009.

44 references, page 1 of 3
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue