Data collected for the paper "Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios"

ATTANASIO, GIUSEPPE; Savoldi, Beatrice; Chechelnitsky, Daniel; Negri, Matteo; Carpuat, Marine; Sap, Maarten; Torres Martins, Andre Filipe

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Dataset

Data sources: ZENODO

Data collected for the paper "Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios"

Research datakeyboard_double_arrow_right Dataset Under curationPublisher:Zenodo

Authors: ATTANASIO, GIUSEPPE; Savoldi, Beatrice; Chechelnitsky, Daniel; Negri, Matteo; Carpuat, Marine; Sap, Maarten; Torres Martins, Andre Filipe;

doi: 10.5281/zenodo.20544289

Data collected for the paper "Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios"

- Summary

Abstract

This dataset contains the raw audio recordings, machine translation outputs, automatic quality metrics, and human survey responses collected during the Ouvia study. The study investigates how end users perceive the usability of speech-to-text machine translation across three English varieties translated into European Portuguese. Each entry pairs a sender (who records a spoken message), a receiver (who answers questions about the translation), and a validator (who assesses translation quality). See our paper for details: arXiv. DATASET DETAILS Dataset Description Curated by: Giuseppe Attanasio User Study Funded by: European Association for Machine Translation Language(s): English, Portuguese (pt-PT) English Variants: Native US Black, Native US White, and Hindi speakers License: CC BY 4.0 USES Direct Use The dataset can be used within one of the following broad scopes: - Study the difference across demographic groups of end-user perceived usability of each translation outcome.- As an English speech recognition or speech translation benchmark for testing model performance in translating from different English varieties to European Portuguese. Out-of-Scope Use We consider out-of-scope uses: - Training of automatic speech systems or models.- Inquiries targeted to deanonymize study participants.- Cloning voices of the study participants. DATASET STRUCTURE The release contains two JSONL files: data.jsonl (study units — one row per entry) and conversation_data.jsonl (conversation and question metadata). === data.jsonl === The data fields are: # | Field | Description----|----------------------- |-------------------------------------------------- 1 | entry_id | Unique identifier for each data entry within each language variety. 2 | conversation_id | Identifier linking this entry to a specific conversation. 3 | conversation_topic | Topic of the conversation (Health or Everyday). 4 | conversation_source | Source or provenance of the conversation text. 5 | conversation_text | The English conversation starter uttered and recorded by the sender. 6 | sender_unique_id | Anonymized unique identifier for the sender (Prolific participant). 7 | sender_gender | Self-reported gender of the sender (woman or man). 8 | receiver_unique_id | Anonymized unique identifier for the message receiver (Prolific participant). 9 | validator_unique_id | Anonymized unique identifier for the validator (Prolific participant). 10 | final_sender_unique_id | Anonymized unique identifier for the final sender in the conversation flow. It matches sender_unique_id for most entries, but in some cases (when the original sender dropped the study) it may correspond to a different sender who took their place in the study. 11 | translation_text | The automatically translated conversation starter into Portuguese (pt-PT). 12 | translation_model | Speech translation model used to translate the conversation. One of DeSTA2, Phi 4, Voxtral, or Tower+. 13 | questions | List of questions shown to the receiver. 14 | receiver_responses | Responses provided by the receiver. 15 | validator_translation_score | Overall translation quality score assigned by the validator. 16 | validator_evaluations | Detailed ratings provided by the validator. 17 | validator_corrections | Number of responses assessed as incorrect by the validator. 18 | validator_question_total | Total number of questions the validator was asked to review. 19 | Unbabel--wmt23-cometkiwi-da-xl | Translation quality score from the Unbabel wmt23-cometkiwi-da-xl COMET model (continuous, unnormalized, higher is better). 20 | Unbabel--wmt22-cometkiwi-da | Translation quality score from the Unbabel wmt22-cometkiwi-da COMET model (continuous, unnormalized, higher is better). 21 | Unbabel--wmt22-comet-da | Translation quality score from the Unbabel wmt22-comet-da COMET model (continuous, unnormalized, higher is better). 22 | Unbabel--XCOMET-XL | Translation quality score from the Unbabel XCOMET-XL model (continuous, unnormalized, higher is better). 23 | google--metricx-24-hybrid-xl-v2p6 | MetricX-24 translation quality score, normalized to [0, 1] range (higher is better). Original scores divided by 25 and inverted. 24 | baseline_satisfaction_score | Baseline survey response (1-5 Likert) for satisfaction (before seeing the validator assessment). 25 | baseline_trust_score | Baseline survey response (1-5 Likert) for trust (before seeing the validator assessment). 26 | baseline_reliance_score | Baseline survey response (1-5 Likert) for reliance (before seeing the validator assessment). 27 | satisfaction_score | Survey response (1-5 Likert) for satisfaction after seeing the validator assessment. 28 | trust_score | Survey response (1-5 Likert) for trust in the translation after seeing the validator assessment. 29 | reliance_score | Survey response (1-5 Likert) for reliance on the translation after seeing the validator assessment. 30 | language_variety | Language variety of the data (US Black, US White, or Hindi). 31 | usability_score | Average usability score — mean of satisfaction_score, trust_score, and reliance_score. 32 | baseline_usability_score | Average baseline usability score — mean of the three baseline scores. 33 | qa_score | QA score (0-1, higher is better). Computed as 1 - (validator_corrections / validator_question_total). 34 | audio | Relative path to the audio file for this entry, in the format ./wav/<variety>/entry_<entry_id>.wav. === conversation_data.jsonl === This file contains the conversation starters and associated questions. Each row is one question within a conversation. The data fields are: # | Field | Description----|---------------|-------------------------------------------------- 1 | conversation_id | Identifier linking to a specific conversation. Joins with data.jsonl on this field. 2 | conversation_topic | Topic of the conversation (Health or Everyday). 3 | conversation_source | Source or provenance of the conversation text (e.g., MED-MT). 4 | conversation_text | The English conversation starter text that the sender recorded. 5 | question_id | Unique identifier for each question within a conversation. 6 | question_text | The English question generated automatically. 7 | translated_question | The machine-translated question text into Portuguese (pt-PT) shown to the receiver. DATASET CREATION Curation Rationale We release all data collected during our study for reproducibility and to facilitate future research on measuring the usability of speech translation systems in real-world situations. Source Data Data Collection and Processing We conducted an online study through a custom web platform. See our paper linked above for details. Who are the source data producers? All data was collected through compensated crowdworkers who were recruited via Prolific (https://www.prolific.com/). Personal and Sensitive Information All personally identifiable information has been removed or anonymized before release. Prolific participant IDs have been replaced with random hexadecimal strings. Audio recordings are identified only by entry ID and language variety. No demographic data beyond self-reported gender (restricted to woman/man) and language variety is included. BIAS, RISKS, AND LIMITATIONS To address the ethical risks associated with collecting personal data, including sociodemographic attributes and recorded voice samples, we have implemented comprehensive risk management measures aligned with data protection principles and research ethics standards. Addressing the Risk of Personal Identification While voice recordings and demographic metadata are made publicly available for research purposes, no other information is disclosed. This choice aligns with the principles of data minimization and anonymization while maintaining utility for fairness research in AI systems. Each participant is identified in the dataset by one or more anonymized hexadecimal strings (see sender_unique_id, receiver_unique_id, validator_unique_id, and final_sender_unique_id). These are randomly generated and bear no relation to the original Prolific participant IDs. As a result, it is not possible to uniquely identify a real individual from their anonymized identifier. Addressing the Risk of Voice Cloning We informed all study participants enrolling as "Sender" through the Informed Consent document that there is a risk their voice may be cloned using automatic tools. However, our study design minimizes this risk by collecting only 10 short segments per participant. Segments are short (30 seconds or less) and recorded in a neutral tone, which makes faithful and expressive voice cloning more challenging to achieve. Similarly to the case of personal identification, we will require any user to explicitly agree not to use it to clone individual voices. DATASET CARD CONTACT Giuseppe Attanasio: gattanasio.work@gmail.com

Found an issue? Give us feedback