MACE: Leveraging Audio for Evaluating Audio Captioning Systems

Name: MACE: Leveraging Audio for Evaluating Audio Captioning Systems
Keywords: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing

Dixit, Satvik; Deshmukh, Soham; Raj, Bhiksha

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2024

Data sources: arXiv.org e-Print Archive

https://doi.org/10.1109/icassp...

Article . 2025 . Peer-reviewed

License: STM Policy #29

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2024

License: CC BY

Data sources: Datacite

MACE: Leveraging Audio for Evaluating Audio Captioning Systems

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 06 Apr 2025Embargo end date: 01 Jan 2024Publisher:IEEEJournal:2025 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)

Authors: Dixit, Satvik; Deshmukh, Soham; Raj, Bhiksha;

doi: 10.1109/icasspw65056.2025.11011270 , 10.48550/arxiv.2411.00321

arXiv: 2411.00321

MACE: Leveraging Audio for Evaluating Audio Captioning Systems

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

The Automated Audio Captioning (AAC) task aims to describe an audio signal using natural language. To evaluate machine-generated captions, the metrics should take into account audio events, acoustic scenes, paralinguistics, signal characteristics, and other audio information. Traditional AAC evaluation relies on natural language generation metrics like ROUGE and BLEU, image captioning metrics such as SPICE and CIDEr, or Sentence-BERT embedding similarity. However, these metrics only compare generated captions to human references, overlooking the audio signal itself. In this work, we propose MACE (Multimodal Audio-Caption Evaluation), a novel metric that integrates both audio and reference captions for comprehensive audio caption evaluation. MACE incorporates audio information from audio as well as predicted and reference captions and weights it with a fluency penalty. Our experiments demonstrate MACE's superior performance in predicting human quality judgments compared to traditional metrics. Specifically, MACE achieves a 3.28% and 4.36% relative accuracy improvement over the FENSE metric on the AudioCaps-Eval and Clotho-Eval datasets respectively. Moreover, it significantly outperforms all the previous metrics on the audio captioning evaluation task. The metric is opensourced at https://github.com/satvik-dixit/mace

Related Organizations

Carnegie Mellon University
United States

Keywords

FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing

1 Research products, page 1 of 1

mace software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

MACE: Leveraging Audio for Evaluating Audio Captioning Systems

MACE: Leveraging Audio for Evaluating Audio Captioning Systems

1 Research products, page 1 of 1

mace software on GitHub