Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality

Name: Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality
Keywords: Human-Computer Interaction, FOS: Computer and information sciences, Computation and Language, Computation and Language (cs.CL), Human-Computer Interaction (cs.HC)

Haq, Sami Ul; Castilho, Sheila; Graham, Yvette

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2025

Data sources: arXiv.org e-Print Archive

https://doi.org/10.18653/v1/20...

Article . 2025 . Peer-reviewed

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2025

License: CC BY

Data sources: Datacite

Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2025Embargo end date: 01 Jan 2025Publisher:Association for Computational Linguistics (ACL)Journal:Proceedings of the Tenth Conference on Machine Translation

Authors: Haq, Sami Ul; Castilho, Sheila; Graham, Yvette;

doi: 10.18653/v1/2025.wmt-1.3 , 10.48550/arxiv.2509.14023

arXiv: 2509.14023

Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality

- Summary
- Subjects
- Metrics

Abstract

Machine Translation (MT) has achieved remarkable performance, with growing interest in speech translation and multimodal approaches. However, despite these advancements, MT quality assessment remains largely text centric, typically relying on human experts who read and compare texts. Since many real-world MT applications (e.g Google Translate Voice Mode, iFLYTEK Translator) involve translation being spoken rather printed or read, a more natural way to assess translation quality would be through speech as opposed text-only evaluations. This study compares text-only and audio-based evaluations of 10 MT systems from the WMT General MT Shared Task, using crowd-sourced judgments collected via Amazon Mechanical Turk. We additionally, performed statistical significance testing and self-replication experiments to test reliability and consistency of audio-based approach. Crowd-sourced assessments based on audio yield rankings largely consistent with text only evaluations but, in some cases, identify significant differences between translation systems. We attribute this to speech richer, more natural modality and propose incorporating speech-based assessments into future MT evaluation frameworks.

Accepted at WMT2025 (ENNLP) for oral presented

Keywords

Human-Computer Interaction, FOS: Computer and information sciences, Computation and Language, Computation and Language (cs.CL), Human-Computer Interaction (cs.HC)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green