
arXiv: 2401.15196
Regularized Markov Decision Processes serve as models of sequential decision making under uncertainty wherein the decision maker has limited information processing capacity and/or aversion to model ambiguity. With functional approximation, the convergence properties of learning algorithms for regularized MDPs (e.g. soft Q-learning) are not well understood because the composition of the regularized Bellman operator and a projection onto the span of basis vectors is not a contraction with respect to any norm. In this paper, we consider a bi-level optimization formulation of regularized Q-learning with linear functional approximation. The {\em lower} level optimization problem aims to identify a value function approximation that satisfies Bellman's recursive optimality condition and the {\em upper} level aims to find the projection onto the span of basis vectors. This formulation motivates a single-loop algorithm with finite time convergence guarantees. The algorithm operates on two time-scales: updates to the projection of state-action values are `slow' in that they are implemented with a step size that is smaller than the one used for `faster' updates of approximate solutions to Bellman's recursive optimality equation. We show that, under certain assumptions, the proposed algorithm converges to a stationary point in the presence of Markovian noise. In addition, we provide a performance guarantee for the policies derived from the proposed algorithm.
FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence
FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
