
General description This resource is a high-quality Basque speech corpus compiled by HiTZ Zentroa / AhoLab.It consists of studio-quality audio recordings in WAV format and their corresponding orthographic text transcriptions. The corpus contains read speech produced by professional native Basque speakers, recorded in a professional recording studio under controlled acoustic conditions. The material was originally designed for the development of text-to-speech (TTS) systems, with careful attention to audio quality, pronunciation clarity, and phonetic coverage. The recordings were produced by two speakers, Maider and Antton, and cover a range of sentence types and orthographic patterns, including declarative, interrogative, and exclamative sentences, as well as specific categories such as Spanish proper names, numerical expressions, and Basque-specific spelling phenomena (e.g., “tt”). Corpus composition The following table summarizes the distribution of utterances by category and speaker: Category Maider Antton Spanish names 750 750 Interrogative sentences 2100 2103 Exclamative sentences 1476 1476 Declarative sentences 9920 9920 “tt” spelling examples 246 246 Numbers 250 250 In total, the corpus comprises: Maider: 13,500 utterances, approximately 17 h 33 min Antton: 13,500 utterances, approximately 16 h 45 min Technical details Property Value Language Basque (Euskara, eu) Speakers Maider, Antton Speaking style Read speech Recording Professional studio Sample rate 48,000 Hz Channels 1 (mono) Encoding PCM signed 24-bit, WAV Intended use This corpus was primarily designed for text-to-speech (TTS) system development, particularly for high-quality or neural TTS models that benefit from: Clean, studio-recorded audio Consistent speaking style Accurate orthographic transcriptions Coverage of specific phonetic and orthographic phenomena in Basque Data organization The corpus is distributed as one compressed TAR archive per speaker, each containing the corresponding audio recordings in WAV format. Within each archive: Audio files are named using a unique utterance identifier, e.g.NEU_00001.wav, NEU_00002.wav, … All recordings correspond to read utterances produced by a single speaker. In addition, a plain-text transcription file is provided per speaker. Each line in the transcription file associates an utterance identifier with its orthographic transcription using the following format: NEU_00001 text of the sentence The utterance identifier matches the WAV filename (without extension), enabling straightforward pairing of audio files and transcriptions. Licensing Creative Commons Attribution 4.0 International (CC BY 4.0) Ethical considerations All speakers provided informed consent for the recording and distribution of their voices. Funding The development of this resource has been funded by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215335, and by a grant from the Department of Culture and Language Policy of the Basque Government (IKER-GAITU project). Versioning This is version 1.0 of the dataset. Contact aholab@aholab.ehu.eus HiTZ Center - Aholab, University of the Basque Country UPV/EHU https://aholab.ehu.eus/aholab/ https://www.hitz.eus/
Basque, speech synthesis, Speech, Speech recordings
Basque, speech synthesis, Speech, Speech recordings
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
