
doi: 10.25560/123433
This thesis explores the potential of audio Explainable Artificial Intelligence (XAI) to improve the interpretability of deep audio processing models. While most existing XAI methods focus on visual and textual explanations, audio explanations offer a more intuitive approach for audio-based tasks. They provide a unique level of expressiveness, particularly where visual explanations require specialised knowledge. By aligning explanations with the audio domain, this research aims to bridge the interpretability gap and enhance understanding of complex audio models. As a case study, this thesis examines COVID-19 detection from cough and speech audio, considering both classifier performance and explainability. To enhance interpretability, CoughLIME, a modified version of LIME tailored for cough data, is introduced. CoughLIME generates faithful and listenable explanations, addressing a key challenge in trust for audio-based COVID-19 classifiers. With transformer models excelling in audio processing, the need for interpretability of their complex decision-making has grown. This thesis proposes a technique to explain audio-processing transformers by integrating their attention mechanisms with non-negative matrix factorisation (NMF). NMF decomposes audio into spectral patterns, while attention weights identify the most relevant time activations. By reconstructing key audio components, the method generates high-fidelity, listenable explanations, validated through audio classification tasks. Additionally, a novel explanation method is introduced, leveraging the meaningful representation space and generative capacity of audio foundation models. By integrating feature attribution techniques, significant features in their embedding space are identified, enabling the generation of meaningful audio explanations. Extensive evaluations explaining audio classification models confirm the effectiveness of this approach. Finally, we propose a novel framework, extending beyond traditional feature attribution which emphasise only the most relevant features, overlooking the broader representational space, including less important features. Rather than removing features, the framework uses generative audio language models to replace removed features with contextually appropriate alternatives, offering a more comprehensive understanding of model behaviour.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
