<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2020Embargo end date: 01 Jan 2020Publisher:arXiv

Authors: Errui Ding; Dejing Dou; Jiang Minyue; Weiyao Lin; Rui Qian; Di Hu; Xiao Tan; +1 Authors

doi: 10.48550/arxiv.2010.05466 , 10.5281/zenodo.4079385 , 10.5281/zenodo.4079386

arXiv: http://arxiv.org/abs/2010.05466

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

- Summary
- Subjects
- Metrics

Abstract

Discriminatively localizing sounding objects in cocktail-party, i.e., mixed sound scenes, is commonplace for humans, but still challenging for machines. In this paper, we propose a two-stage learning framework to perform self-supervised class-aware sounding object localization. First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes. Then, class-aware object localization maps are generated in the cocktail-party scenarios by referring the pre-learned object knowledge, and the sounding objects are accordingly selected by matching audio and visual object category distributions, where the audiovisual consistency is viewed as the self-supervised signal. Experimental results in both realistic and synthesized cocktail-party videos demonstrate that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes. Code is available at https://github.com/DTaoo/Discriminative-Sounding-Objects-Localization.

To appear in NeurIPS 2020. Previous Title: Learning to Discriminatively Localize Sounding Objects in a Cocktail-party Scenario

Related Organizations

Shanghai Jiao Tong University
China (People's Republic of)
Baidu (China)
China (People's Republic of)

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound, Machine Learning (cs.LG), Multimedia (cs.MM), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	7
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%