Shawi-Amazon: A Parallel Dataset for Spanish-Shawi Low-Resource Machine Translation

Low-resource languages, particularly those from the Amazonian region, remain largely underrepresented in current Natural Language Processing (NLP) research. In this work, we introduce the Shawi-Amazon Corpus, the first standardized parallel dataset for the Shawi (Chayahuita) language paired with Spanish. The corpus comprises approximately 9,210 aligned sentence pairs derived from the New Testament and Genesis. We detail a robust data engineering pipeline designed to address complex alignment challenges, specifically "many-to-one" verse mappings and textual variants between the Textus Receptus and Critical Text traditions. To ensure rigorous benchmarking, we implement a document-level splitting strategy, preventing data leakage between training and evaluation sets. This resource is released in standardized formats to facilitate future research in Neural Machine Translation (NMT) for the Cahuapanan language family, contributing to the digital preservation of indigenous heritage.

Related Organizations

National University of Engineering
Peru

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average