KenSwQuAD—A Question Answering Dataset for Swahili Low-resource Language

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type , Preprint 06 Apr 2023Embargo end date: 01 Jan 2022 English Publisher:Association for Computing Machinery (ACM)Journal:ACM Transactions on Asian and Low-Resource Language Information Processing, volume 22, pages 1-20 (issn: 2375-4699, eissn: 2375-4702,

Copyright policy )

Authors: Barack Wamkaya Wanjawa; Lilian Wanzare; Florence Indede; Owen McOnyango; Lawrence Muchemi; Edward Ombui;

doi: 10.1145/3578553 , 10.48550/arxiv.2205.02364 , 10.7910/dvn/otl0lm

arXiv: 2205.02364

KenSwQuAD—A Question Answering Dataset for Swahili Low-resource Language

- Summary
- Subjects
- Metrics

Abstract

The need for question-answering (QA) datasets in low-resource languages is the motivation of this research, leading to the development of the Kencorpus Swahili Question Answering Dataset (KenSwQuAD). This dataset is annotated from raw story texts of Swahili, a low-resource language that is predominantly spoken in eastern Africa and in other parts of the world. Question-answering datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learning systems need training data such as the gold-standard question-answering set developed in this research. The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project, a Kenyan languages corpus. The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA pairs each, resulting in a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to the QA task confirmed that the dataset can be usable for such tasks. KenSwQuAD has also contributed to resourcing of the Swahili language.

Related Organizations

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Machine Reading Comprehension, Computer and Information Science, I.2.7, Computation and Language (cs.CL), Question Answering, Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	7
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%