Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2023
License: CC BY
Data sources: ZENODO
ZENODO
Dataset . 2023
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2023
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Nos_ParlaSpeech-GL: Galician ASR corpus

Authors: Carmen Magariños; Adrián Vidal Miguéns; Adina Ioana Vladu; Noelia García Díaz; Marta Vázquez Abuín; Ainhoa Vivel Couso; Daniel Bardanca; +1 Authors

Nos_ParlaSpeech-GL: Galician ASR corpus

Abstract

This corpus is publicly accessible upon accepting T&Cs and requesting access. Nos_ParlaSpeech-GL is an ASR corpus of more than 1,600 hours of automatically aligned speech and text, created from audio and official transcripts of Galician parliamentary sessions celebrated between 2015 and 2022. The content belongs to the Galician Parliament and the data is released according to their terms of use. The corpus is split into two subcorpora, "clean" and "other". The segments included in the "clean" subcorpus were filtered according to several alignment quality criteria, whereas the "other" subcorpus comprises the segments that were discarded in the filtering process. The details of both subcorpora can be found in the table below: Subcorpus No. of hours No. of segments Clean 1,196.92 667,308 Other 477.71 130,332 Total 1674,63 797,64 Moreover, each speech segment is tagged with the ID of its corresponding speaker. Metadata of the different speakers, compiled within the ParlaMint-GL project, can be accessed here. The file naming scheme of the audio files consists of an ID comprising: a four-letter code in capitals denoting the source of the data (Minutes of the Galician Parliament), followed by a 3-digit number identifying the session number and an 8-digit date number in the DDMMYYYY format, all separated by underscores (e. g., DSPG_095_27012015.wav). For the transcription files, this ID is preceded, separated by an underscore, by the word indicating the subcorpus to which the file belongs to: "clean" or "other" (e. g., clean_DSPG_095_27012015.stm, other_DSPG_095_27012015.stm). The corpus is available in STM and JSON formats, and the audio files are released in 16 kHz 16-bit WAV format. Hugging Face version The corpus is also available in Hugging Face. Disclaimer: We are not responsible for any inconsistencies in speaker identification that stem from misidentification in the original transcripts. Funding and acknowledgements: This corpus was compiled in collaboration with VICOMTECH. "The Nós project: Galician in the society and economy of Artificial Intelligence" is possible thanks to the funding resulting from the agreement 2021-CP080 between the Xunta de Galicia and the University of Santiago de Compostela, and thanks to the Investigo program, within the National Recovery, Transformation and Resilience Plan, within the framework of the European Recovery Fund (NextGenerationEU). We would like to thank the Galician Parliament for their kind collaboration in providing the original data. For more information, please go to https://nos.gal/ or contact the Nós project at proxecto.nos@usc.gal.Terms and ConditionsBy accessing and using this dataset, you agree to comply with all applicable laws and ethical standards regarding the protection of individual rights. The dataset contains voice files, transcripts, and metadata, including participant identity information, provided solely for research and development purposes. Users are strictly prohibited from using the dataset in any way that infringes upon the rights, privacy, or dignity of any individual represented within it. Any misuse, including but not limited to attempts to engage in discriminatory, harmful, or unlawful activities, is expressly forbidden.

Related Organizations
Keywords

ASR, Galician, speech corpus, plenary sessions, forced alignment, parliamentary data, Galician Parliament

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 30
  • 30
    views
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
0
Average
Average
Average
30