
arXiv: 2002.09795
The use of target networks is a common practice in deep reinforcement learning for stabilizing the training; however, theoretical understanding of this technique is still limited. In this paper, we study the so-called periodic Q-learning algorithm (PQ-learning for short), which resembles the technique used in deep Q-learning for solving infinite-horizon discounted Markov decision processes (DMDP) in the tabular setting. PQ-learning maintains two separate Q-value estimates - the online estimate and target estimate. The online estimate follows the standard Q-learning update, while the target estimate is updated periodically. In contrast to the standard Q-learning, PQ-learning enjoys a simple finite time analysis and achieves better sample complexity for finding an epsilon-optimal policy. Our result provides a preliminary justification of the effectiveness of utilizing target estimates or networks in Q-learning algorithms.
FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
