iVideoGPT: Interactive VideoGPTs are Scalable World Models

Name: iVideoGPT: Interactive VideoGPTs are Scalable World Models
Keywords: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Robotics, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Robotics (cs.RO), Machine Learning (cs.LG)

Jialong Wu 0001; Shaofeng Yin; Ningya Feng; Xu He; Dong Li 0016; Jianye Hao; Mingsheng Long

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2024

Data sources: arXiv.org e-Print Archive

https://doi.org/10.52202/07901...

Article . 2024 . Peer-reviewed

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2024

License: arXiv Non-Exclusive Distribution

Data sources: Datacite

DBLP

Conference object

Data sources: DBLP

DBLP

Article

Data sources: DBLP

iVideoGPT: Interactive VideoGPTs are Scalable World Models

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 01 Jan 2024Embargo end date: 01 Jan 2024Publisher:Neural Information Processing Systems Foundation, Inc. (NeurIPS)Journal:Advances in Neural Information Processing Systems 37

Authors: Jialong Wu 0001; Shaofeng Yin; Ningya Feng; Xu He; Dong Li 0016; Jianye Hao; Mingsheng Long;

doi: 10.52202/079017-2173 , 10.48550/arxiv.2405.15223

arXiv: 2405.15223

iVideoGPT: Interactive VideoGPTs are Scalable World Models

- Summary
- Subjects
- Related research
  (9)
- Metrics

Abstract

World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications. Code and pre-trained models are available at https://thuml.github.io/iVideoGPT.

NeurIPS 2024. Code is available at project website: https://thuml.github.io/iVideoGPT

Related Organizations

Hebei University
China (People's Republic of)
Tsinghua University
Tianjin University
China (People's Republic of)
Tsinghua University
Tsinghua University

View all View all

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Robotics, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Robotics (cs.RO), Machine Learning (cs.LG)

9 Research products, page 1 of 1

transformers software on GitHub
IsRelatedTo
PerceptualSimilarity software on GitHub
IsRelatedTo
amused software on GitHub
IsRelatedTo
robodesk software on GitHub
IsRelatedTo
diffusers software on GitHub
IsRelatedTo
stylegan-v software on GitHub
IsRelatedTo
piqa software on GitHub
IsRelatedTo
drqv2 software on GitHub
IsRelatedTo
vp software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Green