
handle: 11311/1177647
Summary: This paper presents a study of the policy improvement step that can be usefully exploited by approximate policy-iteration algorithms. When either the policy evaluation step or the policy improvement step returns an approximated result, the sequence of policies produced by policy iteration may not be monotonically increasing, and oscillations may occur. To address this issue, we consider safe policy improvements, i.e., at each iteration, we search for a policy that maximizes a lower bound to the policy improvement w.r.t. the current policy, until no improving policy can be found. We propose three safe policy-iteration schemas that differ in the way the next policy is chosen w.r.t. the estimated greedy policy. Besides being theoretically derived and discussed, the proposed algorithms are empirically evaluated and compared on some chain-walk domains, the prison domain, and on the Blackjack card game.
Approximate Dynamic Programming, reinforcement learning, Policy Oscillation, Markov and semi-Markov decision processes, Learning and adaptive systems in artificial intelligence, Dynamic programming, Reinforcement Learning, approximate policy iteration, policy chattering, Policy Chattering, Approximate Policy Iteration, approximate dynamic programming, Markov decision process, Markov Decision Process, policy oscillation
Approximate Dynamic Programming, reinforcement learning, Policy Oscillation, Markov and semi-Markov decision processes, Learning and adaptive systems in artificial intelligence, Dynamic programming, Reinforcement Learning, approximate policy iteration, policy chattering, Policy Chattering, Approximate Policy Iteration, approximate dynamic programming, Markov decision process, Markov Decision Process, policy oscillation
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
