descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2023Embargo end date: 01 Jan 2023Publisher:Association for Computational Linguistics (ACL)Journal:Findings of the Association for Computational Linguistics: EACL 2023

Authors: Xie, Ruoyu; Anastasopoulos, Antonios;

doi: 10.18653/v1/2023.findings-eacl.111 , 10.48550/arxiv.2301.09685

arXiv: http://arxiv.org/abs/2301.09685

Noisy Parallel Data Alignment

- Summary
- Subjects
- Metrics

Abstract

An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.

EACL 2023 camera-ready version

Related Organizations

George Mason University
United States

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

Green

Fields of Science (4) View all

natural sciences

Fields of Science

natural sciences

View all