Online Learning for Stochastic Shortest Path Model via Posterior Sampling

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2021Embargo end date: 01 Jan 2021Publisher:arXivJournal:CoRR, volume abs/2106.05335

Authors: Mehdi Jafarnia-Jahromi; Liyu Chen; Rahul Jain 0002; Haipeng Luo;

doi: 10.48550/arxiv.2106.05335

arXiv: 2106.05335

Online Learning for Stochastic Shortest Path Model via Posterior Sampling

- Summary
- Subjects
- Metrics

Abstract

We consider the problem of online reinforcement learning for the Stochastic Shortest Path (SSP) problem modeled as an unknown MDP with an absorbing state. We propose PSRL-SSP, a simple posterior sampling-based reinforcement learning algorithm for the SSP problem. The algorithm operates in epochs. At the beginning of each epoch, a sample is drawn from the posterior distribution on the unknown model dynamics, and the optimal policy with respect to the drawn sample is followed during that epoch. An epoch completes if either the number of visits to the goal state in the current epoch exceeds that of the previous epoch, or the number of visits to any of the state-action pairs is doubled. We establish a Bayesian regret bound of $O(B_\star S\sqrt{AK})$, where $B_\star$ is an upper bound on the expected cost of the optimal policy, $S$ is the size of the state space, $A$ is the size of the action space, and $K$ is the number of episodes. The algorithm only requires the knowledge of the prior distribution, and has no hyper-parameters to tune. It is the first such posterior sampling algorithm and outperforms numerically previously proposed optimism-based algorithms.

Related Organizations

University of California System
United States
University of Southern California
United States

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Fields of Science (4) View all

Fields of Science