Bellman, Richard. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1st edition, 1957.
Bertsekas, Dimitri P. and Tsitsiklis, John N. Dynamic Programming. Athena Scientific, 1996.
Hallak, Assaf, Tamar, Aviv, Munos, Remi, and Mannor, Shie. Generalized emphatic temporal difference learning: Bias-variance analysis, 2015. [OpenAIRE]
Kahn, Herman and Marshall, Andy W. Methods of reducing sample size in monte carlo computations. Journal of the Operations Research Society of America, 1(5):263- 278, 1953. [OpenAIRE]
Kearns, Michael J and Singh, Satinder P. Bias-variance error bounds for temporal difference updates. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pp. 142-147. Morgan Kaufmann Publishers Inc., 2000.
Mahmood, A Rupam and Sutton, Richard S. Off-policy learning based on weighted importance sampling with linear computational complexity. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, 2015.
Mahmood, A Rupam, Yu, Huizhen, White, Martha, and Sutton, Richard S. Emphatic temporal-difference learning. arXiv preprint arXiv:1507.01569, 2015. [OpenAIRE]
Peng, Jing and Williams, Ronald J. Incremental multi-step q-learning. Machine Learning, 22(1-3):283-290, 1996.
Precup, Doina, Sutton, Richard S, and Singh, Satinder. Eligibility traces for off-policy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning, 2000.
Puterman, Martin L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994. ISBN 0471619779.
Randlov, Jette and Alstrom, Preben. Learning to drive a bicycle using reinforcement learning and shaping. In Proceedings of the Fifteenth International Conference on Machine Learning, 1998.
Rummery, Gavin A and Niranjan, Mahesan. On-line qlearning using connectionist systems. Technical report, Cambridge University Engineering Department., 1994.
Singh, Satinder and Dayan, Peter. Analytical mean squared error curves for temporal difference learning. Machine Learning, 32(1):5-40, 1998.
Singh, Satinder, Jaakkola, Tommi, Littman, Michael L., and Szepesva´ri, Csaba. Convergence results for singlestep on-policy reinforcement-learning algorithms. Machine Learning, 38(3):287-308, 2000.
Sutton, Richard S. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9-44, 1988.