
Introduction This is the official release of the code for DialogueAV: a Dialogue-attended Audiovisual Dataset. Dialogue-AV is a benchmarking dataset with ~258k video clips. Each clip has two dialogue-based descriptions: a Question-Answering Dialogue (QDA) with ten question-answer pairs and a simulated conversation between two "humans" discussing the video. The dialogues come from human-created captions in SOTA benchmarking datasets and machine-generated captions. We use verified annotations from these top datasets, focusing solely on describing the audiovisual content. Description In the Dialogue-AV sample we present next, the input consists of a video containing an audio track along with its original text captions (1). The output is a series of dialogue turns that describe the video's content. We process the input video using audio and video captioners (2), which generate text descriptions corresponding to each modality. All captions, including the original, are transformed into dialogue (4) and question-answer (5) conversations that articulate the audiovisual content. https://github.com/lvilaca16/dialogue-av/blob/main/docs/figures/example_dialogue.png Annotations in (4) and (5) undergo automatic validation (3) before they are accepted into Dialogue-AV. In the automatic validation step (3), accepted samples must: Include between 5 and 20 dialogue turns. Each dialogue turn must have at least one complete sentence. A complete sentence requires at least 1 subject, predicate, object or noun, and 1 verb; it should end with appropriate punctuation and begin with a named character. Additionally, each complete sentence must contain a minimum of 3 words after removing punctuation (avoid simple sentences as "It rains."). Avoid using the terms "caption(s)" or "dialogue(s)", thereby eliminating references to the original prompt. For more details about the data generation process, we refer the reader to the (to be published) manuscript. Correspondence and Maintenance For details about the implementation, generation and usage, please check the official GitHub page.If you observed any issues, please contact us. All project-related issues and feature requests should be submitted through our GitHub Issues page.
Deep Learning, Audio-Video-Language Learning, Multimodal Learning
Deep Learning, Audio-Video-Language Learning, Multimodal Learning
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
