Sublinear regret for learning POMDPs

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Sep 2022Embargo end date: 01 Jan 2021 English Publisher:SAGE PublicationsJournal:Production and Operations Management, volume 31, pages 3,491-3,504 (issn: 1059-1478, eissn: 1937-5956,

Copyright policy )Funded by:NSERC | unidentified

Authors: Yi Xiong; Ningyuan Chen; Xuefeng Gao; Xiang Zhou;

doi: 10.1111/poms.13778 , 10.48550/arxiv.2107.03635

arXiv: 2107.03635

Sublinear regret for learning POMDPs

- Summary
- Subjects
- Metrics

Abstract

We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method‐of‐moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound of O ( T 2 / 3 log T ) $O(T^{2/3}\sqrt {\log T})$ for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.

Related Organizations

University of Toronto
Canada
Chinese University of Hong Kong
China (People's Republic of)
THE CHINESE UNIVERSITY OF HONG KONG
China (People's Republic of)

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Mathematics - Optimization and Control, Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	3
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average