Safe Policy Iteration: A Monotonically Improving Approximate Policy Iteration Approach.

descriptionPublicationkeyboard_double_arrow_right Article 01 Jan 2021 Italy Publisher:Microtome Publishing, Brookline, MAJournal:J. Mach. Learn. Res., volume 22, pages 97:1-97:83

Authors: Metelli A. M.; Pirotta M.; Calandriello D.; Restelli M.;

handle: 11311/1177647

Safe Policy Iteration: A Monotonically Improving Approximate Policy Iteration Approach.

- Summary
- Subjects
- Metrics

Abstract

Summary: This paper presents a study of the policy improvement step that can be usefully exploited by approximate policy-iteration algorithms. When either the policy evaluation step or the policy improvement step returns an approximated result, the sequence of policies produced by policy iteration may not be monotonically increasing, and oscillations may occur. To address this issue, we consider safe policy improvements, i.e., at each iteration, we search for a policy that maximizes a lower bound to the policy improvement w.r.t. the current policy, until no improving policy can be found. We propose three safe policy-iteration schemas that differ in the way the next policy is chosen w.r.t. the estimated greedy policy. Besides being theoretically derived and discussed, the proposed algorithms are empirically evaluated and compared on some chain-walk domains, the prison domain, and on the Blackjack card game.

Country

Italy

Related Organizations

Polytechnic University of Milan
Italy

Keywords

Approximate Dynamic Programming, reinforcement learning, Policy Oscillation, Markov and semi-Markov decision processes, Learning and adaptive systems in artificial intelligence, Dynamic programming, Reinforcement Learning, approximate policy iteration, policy chattering, Policy Chattering, Approximate Policy Iteration, approximate dynamic programming, Markov decision process, Markov Decision Process, policy oscillation

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green