VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

Fu, Tsu-Jui; Li, Linjie; Gan, Zhe; Lin, Kevin; Wang, William Yang; Wang, Lijuan; Liu, Zicheng

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2021

Data sources: arXiv.org e-Print Archive

https://dx.doi.org/10.48550/ar...

Article . 2021

License: CC BY

Data sources: Datacite

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2021Embargo end date: 01 Jan 2021Publisher:arXiv

Authors: Fu, Tsu-Jui; Li, Linjie; Gan, Zhe; Lin, Kevin; Wang, William Yang; Wang, Lijuan; Liu, Zicheng;

doi: 10.48550/arxiv.2111.12681

arXiv: 2111.12681

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

- Summary
- Subjects
- Related research
  (10)
- Metrics

Abstract

A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data. Recent studies try to mitigate this disconnection via end-to-end training. To make it computationally feasible, prior works tend to "imagify" video inputs, i.e., a handful of sparsely sampled frames are fed into a 2D CNN, followed by a simple mean-pooling or concatenation to obtain the overall video representations. Although achieving promising results, such simple approaches may lose temporal information that is essential for performing downstream VidL tasks. In this work, we present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs. Further, unlike previous studies that found pre-training tasks on video inputs (e.g., masked frame modeling) not very effective, we design a new pre-training task, Masked Visual-token Modeling (MVM), for better video modeling. Specifically, the original video frame patches are "tokenized" into discrete visual tokens, and the goal is to recover the original visual tokens based on the masked patches. Comprehensive analysis demonstrates the effectiveness of both explicit temporal modeling via video transformer and MVM. As a result, VIOLET achieves new state-of-the-art performance on 5 video question answering tasks and 4 text-to-video retrieval tasks.

Code is available at https://github.com/tsujuifu/pytorch_violet

Keywords

FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

10 Research products, page 1 of 1

Studies on Acetylenic Compounds. XXXVI. Total Synthesis of Ethyl 3-Amino-3-deoxy-β-DL-arabinofuranoside.
1963IsAmongTopNSimilarDocuments
Generating processors from specifications of instruction sets
2011IsAmongTopNSimilarDocuments
CMOS-based thermopiles using vertically integrated double polycrystalline silicon layers
2013IsAmongTopNSimilarDocuments
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation
2021IsAmongTopNSimilarDocuments
Abstraction of Learning Management Systems Instructional Design Semantics: A Meta-modeling Approach Applied to the Moodle Case-Study
2014IsAmongTopNSimilarDocuments
VindLU: A Recipe for Effective Video-and-Language Pretraining
2023IsAmongTopNSimilarDocuments
A more effective way to label affective expressions
2009IsAmongTopNSimilarDocuments
Concurrent Chemo-Radiotherapy Followed by VIDL (Etoposide, Ifosfamide, Dexamethasone, L-asparaginase) Chemotherapy In Stage I/II Extranodal NK/T-Cell Lymphoma of Nasal Cavity/Nasopharynx
2010IsAmongTopNSimilarDocuments
Comparing Visual Instructional Design Languages
2008IsAmongTopNSimilarDocuments
Concurrent chemoradiotherapy followed by l-asparaginase-containing chemotherapy, VIDL, for localized nasal extranodal NK/T cell lymphoma: CISL08-01 phase II study
2014IsAmongTopNSimilarDocuments

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Fields of Science (4) View all

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

View all