Using offline data to speed up Reinforcement Learning in procedurally generated environments

Name: Using offline data to speed up Reinforcement Learning in procedurally generated environments
Keywords: Imitation Learning, FOS: Computer and information sciences, Diversity, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Artificial Intelligence, Computer Science - Artificial Intelligence, Cognitive Neuroscience, Generalization, Procedurally generated environments

Alain Andres; Lukas Schäfer; Stefano V. Albrecht; Javier Del Ser

Found an issue? Give us feedback

Neurocomputingarrow_drop_down

Neurocomputing

Article . 2025 . Peer-reviewed

License: CC BY

Data sources: Crossref

arXiv.org e-Print Archive

Preprint . 2023

Data sources: arXiv.org e-Print Archive

Recolector de Ciencia Abierta, RECOLECTA

Article . 2025

Data sources: Recolector de Ciencia Abierta, RECOLECTA

https://dx.doi.org/10.48550/ar...

Article . 2023

License: arXiv Non-Exclusive Distribution

Data sources: Datacite

Using offline data to speed up Reinforcement Learning in procedurally generated environments

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Feb 2025Embargo end date: 01 Jan 2023 English Publisher:Elsevier BVJournal:Neurocomputing, volume 618, page 129,079 (issn: 0925-2312,

Copyright policy )

Authors: Alain Andres; Lukas Schäfer; Stefano V. Albrecht; Javier Del Ser;

doi: 10.1016/j.neucom.2024.129079 , 10.48550/arxiv.2304.09825

arXiv: 2304.09825

handle: 11556/5615

Using offline data to speed up Reinforcement Learning in procedurally generated environments

- Summary
- Subjects
- Metrics

Abstract

One of the key challenges of Reinforcement Learning (RL) is the ability of agents to generalise their learned policy to unseen settings. Moreover, training RL agents requires large numbers of interactions with the environment. Motivated by the recent success of Offline RL and Imitation Learning (IL), we conduct a study to investigate whether agents can leverage offline data in the form of trajectories to improve the sample-efficiency in procedurally generated environments. We consider two settings of using IL from offline data for RL: (1) pre-training a policy before online RL training and (2) concurrently training a policy with online RL and IL from offline data. We analyse the impact of the quality (optimality of trajectories) and diversity (number of trajectories and covered level) of available offline trajectories on the effectiveness of both approaches. Across four well-known sparse reward tasks in the MiniGrid environment, we find that using IL for pre-training and concurrently during online RL training both consistently improve the sample-efficiency while converging to optimal policies. Furthermore, we show that pre-training a policy from as few as two trajectories can make the difference between learning an optimal policy at the end of online training and not learning at all. Our findings motivate the widespread adoption of IL for pre-training and concurrent IL in procedurally generated environments whenever offline trajectories are available or can be generated.

Initially presented at the Adaptive and Learning Agents Workshop (ALA) at the AAMAS conference 2023; the current extended version was accepted at Neurocomputing journal

Related Organizations

University of Edinburgh
United Kingdom
TECNALIA
Spain
Tecnalia
Spain
University of the Basque Country
Spain
FUNDACION TECNALIA RESEARCH & INNOVATION
Spain

View all View all

Keywords

Imitation Learning, FOS: Computer and information sciences, Diversity, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Artificial Intelligence, Computer Science - Artificial Intelligence, Cognitive Neuroscience, Generalization, Procedurally generated environments, Reinforcement Learning, Computer Science Applications, Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	4
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

4

Top 10%

Average

Green

hybrid

Related to Research communities

Knowmad Institut

UArctic