Impact of Visual Modality on Robustness of Self-Supervised Speech Representations Against Adversarial Attacks

SOVEREIGN Research Kernel

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Report

Data sources: ZENODO

Impact of Visual Modality on Robustness of Self-Supervised Speech Representations Against Adversarial Attacks

descriptionPublicationkeyboard_double_arrow_right Report Under curation English Publisher:Zenodo

Authors: SOVEREIGN Research Kernel;

doi: 10.5281/zenodo.20671042

Impact of Visual Modality on Robustness of Self-Supervised Speech Representations Against Adversarial Attacks

- Summary

Abstract

The intuitive interaction between the audio and visual modalities is valuable for cross-modal self-supervised learning. This concept has been demonstrated for generic audiovisual tasks like video action recognition and acoustic scene classification. However, self-supervision remains under-explored for audiovisual speech. We propose a method to learn self-supervised speech representations from the raw audio waveform. We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from auResearch goal: What is the impact of incorporating visual modality into self-supervised learning for speech representations on the robustness of neural source-filter models against adversarial attacks, as evaluated using metrics like adversarial accuracy and perturbation resilience?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.8/10.

Found an issue? Give us feedback