KARAKALPAK SPEECH CORPUS: THE FIRST BENCHMARK DATASET FOR AUTOMATIC SPEECH RECOGNITION

Niyetbay Uteuliev; Kabul Khudaybergenov; Jabbar Kudaybergenov; Tangirbergen Kudaybergenov

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Article . 2026

License: CC BY

Data sources: ZENODO

ZENODO

Article . 2026

License: CC BY

Data sources: Datacite

ZENODO

Article . 2026

License: CC BY

Data sources: Datacite

KARAKALPAK SPEECH CORPUS: THE FIRST BENCHMARK DATASET FOR AUTOMATIC SPEECH RECOGNITION

descriptionPublicationkeyboard_double_arrow_right Article 18 Mar 2026 English Publisher:Zenodo

Authors: Niyetbay Uteuliev; Kabul Khudaybergenov; Jabbar Kudaybergenov; Tangirbergen Kudaybergenov;

doi: 10.5281/zenodo.19079669 , 10.5281/zenodo.19079670

KARAKALPAK SPEECH CORPUS: THE FIRST BENCHMARK DATASET FOR AUTOMATIC SPEECH RECOGNITION

- Summary
- Subjects
- Metrics

Abstract

While large-scale pre-trained models have significantly advanced multilingual Automatic Speech Recognition (ASR), many low-resource languages remain under-served due to the scarcity of high-quality annotated speech corpora. This paper introduces the Karakalpak Speech Corpus (KSC), the first publicly available benchmark dataset for Karakalpak, a Turkic language spoken by over two million people primarily in Karakalpakstan. The corpus encompasses 50 hours of predominantly read speech. The data was collected from 25 native speakers with a balanced gender distribution. To establish a performance benchmark, we fine-tuned the Wav2Vec 2.0 architecture, specifically evaluating the efficacy of transfer learning from multilingual pre-trained models.

Related Organizations

TASHKENT KIMYO INTERNATIONAL UNIVERSITY
Uzbekistan

Keywords

speech-to-text, Machine Learning, Deep Learning, speech dataset, speech recognition, Wav2Vec 2.0 model, transfer learning

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average