
We propose See2Hear (S2H), a framework that jointly learns audio-visual representations for object detection and sound source separation from videos. Existing methods do not fully exploit the synergy between the detection and separation tasks, often relying on disjointly pre-trained visual encoders. In this paper, S2H integrates both tasks in an end-to-end trainable unified structure using transformer-based architectures. A naive combination of them, however, results in suboptimal performance. We propose a dynamic filtering mechanism that selects relevant object queries from the object detector to resolve this issue. We conduct extensive experiments to verify that our approach achieves the state-of-the-art performance in audio source separation on the MUSIC and MUSIC-21 datasets, while maintaining competitive object detection performance. Ablation studies confirm that the joint training of detection and separation is mutually beneficial for both tasks.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
