
arXiv: 2207.00713
We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation introduced by Wang et al. (2020). As the conventional (big) Q-function collapses in continuous time, we consider its first-order approximation and coin the term ``(little) q-function". This function is related to the instantaneous advantage rate function as well as the Hamiltonian. We develop a ``q-learning" theory around the q-function that is independent of time discretization. Given a stochastic policy, we jointly characterize the associated q-function and value function by martingale conditions of certain stochastic processes, in both on-policy and off-policy settings. We then apply the theory to devise different actor-critic algorithms for solving underlying RL problems, depending on whether or not the density function of the Gibbs measure generated from the q-function can be computed explicitly. One of our algorithms interprets the well-known Q-learning algorithm SARSA, and another recovers a policy gradient (PG) based continuous-time algorithm proposed in Jia and Zhou (2022b). Finally, we conduct simulation experiments to compare the performance of our algorithms with those of PG-based algorithms in Jia and Zhou (2022b) and time-discretized conventional Q-learning algorithms.
70 pages, 4 figures, appended with an erratum
q-function, FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Artificial Intelligence, martingale, Learning and adaptive systems in artificial intelligence, Computational Finance (q-fin.CP), policy improvement, on-policy and off-policy, Machine Learning (cs.LG), FOS: Economics and business, Quantitative Finance - Computational Finance, Artificial Intelligence (cs.AI), continuous-time reinforcement learning
q-function, FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Artificial Intelligence, martingale, Learning and adaptive systems in artificial intelligence, Computational Finance (q-fin.CP), policy improvement, on-policy and off-policy, Machine Learning (cs.LG), FOS: Economics and business, Quantitative Finance - Computational Finance, Artificial Intelligence (cs.AI), continuous-time reinforcement learning
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 4 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
