An approximate policy iteration viewpoint of actor–critic algorithms

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Sep 2025Embargo end date: 01 Jan 2022 English Publisher:Elsevier BVJournal:Automatica, volume 179, page 112,395 (issn: 0005-1098,

Copyright policy )

Authors: Zaiwei Chen; Siva Theja Maguluri;

doi: 10.1016/j.automatica.2025.112395 , 10.48550/arxiv.2208.03247

arXiv: 2208.03247

An approximate policy iteration viewpoint of actor–critic algorithms

- Summary
- Subjects
- Metrics

Abstract

In this work, we consider policy-based methods for solving the reinforcement learning problem, and establish the sample complexity guarantees. A policy-based algorithm typically consists of an actor and a critic. We consider using various policy update rules for the actor, including the celebrated natural policy gradient. In contrast to the gradient ascent approach taken in the literature, we view natural policy gradient as an approximate way of implementing policy iteration, and show that natural policy gradient (without any regularization) enjoys geometric convergence when using increasing stepsizes. As for the critic, we consider using TD-learning with linear function approximation and off-policy sampling. Since it is well-known that in this setting TD-learning can be unstable, we propose a stable generic algorithm (including two specific algorithms: the $λ$-averaged $Q$-trace and the two-sided $Q$-trace) that uses multi-step return and generalized importance sampling factors, and provide the finite-sample analysis. Combining the geometric convergence of the actor with the finite-sample analysis of the critic, we establish for the first time an overall $\mathcal{O}(ε^{-2})$ sample complexity for finding an optimal policy (up to a function approximation error) using policy-based methods under off-policy sampling and linear function approximation.

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, natural policy gradient, linear function approximation, finite-sample analysis, Analysis of algorithms and problem complexity, Learning and adaptive systems in artificial intelligence, off-policy sampling, complexity, Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Green

Fields of Science (3) View all

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

View all