NTEU_Multilingual_Evaluation_Dataset

Dataset Card for NTEU Multilingual Evaluation Dataset Dataset Description Point of Contact: langtech@bsc.es Dataset Summary This evaluation dataset for Machine Translation was created by the NTEU - Neural Translation for the EU project. The evaluation dataset includes around 1,000 parallel sentences in the 24 official European languages. The original NTEU dataset has been cleaned and filtered by removing empty lines and near-duplicates, and it has been augmented with Catalan. The Catalan version was manually produced by a native Catalan translator from the original English and Spanish versions, and was sponsored by the AINA project. Supported Tasks and Leaderboards This dataset can be used to evaluate bilingual and multilingual machine translation systems for any combination of the 24 official European languages and Catalan in the legal domain. Languages The languages included in the dataset are the following: CODE LANGUAGE SCRIPT bg Bulgarian Cyrillic ca Catalan Latin cs Czech Latin da Danish Latin de German Latin el Greek Greek en English Latin es Spanish Latin et Estonian Latin fi Finnish Latin fr French Latin ga Irish Latin hr Croatian Latin hu Hungarian Latin it Italian Latin lt Lithuanian Latin lv Latvian Latin mt Maltese Latin nl Dutch Latin pl Polish Latin pt Portuguese Latin ro Romanian Latin sk Slovak Latin sl Slovenian Latin sv Swedish Latin Dataset Structure Data Instances A separate .txt file is provided for each language, with sentences aligned in the same order across all files. Each file uses the two-letter language code of its language as the file extension. Data Fields [N/A] Data Splits The dataset contains a single split: Test. Dataset Creation Curation Rationale The aim of this dataset is to promote the evaluation of machine translation systems for the official European languages, plus Catalan. Source Data Initial Data Collection and Normalization The data was originally extracted from EUR-Lex, the official online database of European Union law and other public documents of the European Union (EU), published in the 24 official languages of the EU. The Official Journal (OJ) of the European Union is also published on EUR-Lex. Who are the source language producers? EUR-Lex Annotations Annotation process The dataset does not contain any annotations. Who are the annotators? [N/A] Personal and Sensitive Information No specific anonymisation process has been applied, personal and sensitive information may be present in the data. This needs to be considered when using the data for training models. Considerations for Using the Data Social Impact of Dataset By providing this resource, we intend to promote the evaluation of machine translation systems including all the official European Languages and Catalan, thereby improving the accessibility and visibility of the Catalan language in Europe. Discussion of Biases No specific bias mitigation strategies were applied to this dataset. Inherent biases may exist within the data. Other Known Limitations The dataset contains data of a legal/administrative domain. Applications of this dataset in other domains would be of limited use. Additional Information Dataset Curators Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es). Funding This work has been promoted and financed by the Government of Catalonia through the Aina project. Licensing Information This work is licensed under an Attribution 4.0 International licence. Citation Information For more information about the NTEU Project, please refer to the following paper: @inproceedings{bie-etal-2020-neural, title = "Neural Translation for the {E}uropean {U}nion ({NTEU}) Project", author = "Bi{\'e}, Laurent and Cerd{\`a}-i-Cuc{\'o}, Aleix and Degroote, Hans and Estela, Amando and Garc{\'i}a-Mart{\'i}nez, Mercedes and Herranz, Manuel and Kohan, Alejandro and Melero, Maite and O{'}Dowd, Tony and O{'}Gorman, Sin{\'e}ad and Pinnis, M{\={a}}rcis and Rozis, Roberts and Superbo, Riccardo and Vasi{\c{l}}evskis, Art{\={u}}rs", editor = "Martins, Andr{\'e} and Moniz, Helena and Fumega, Sara and Martins, Bruno and Batista, Fernando and Coheur, Luisa and Parra, Carla and Trancoso, Isabel and Turchi, Marco and Bisazza, Arianna and Moorkens, Joss and Guerberof, Ana and Nurminen, Mary and Marg, Lena and Forcada, Mikel L.", booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation", month = nov, year = "2020", address = "Lisboa, Portugal", publisher = "European Association for Machine Translation", url = "https://aclanthology.org/2020.eamt-1.60/", pages = "477--478", abstract = "The Neural Translation for the European Union (NTEU) project aims to build a neural engine farm with all European official language combinations for eTranslation, without the necessity to use a high-resourced language as a pivot. NTEU started in September 2019 and will run until August 2021." } Contributions [N/A]

Related Organizations

Barcelona Supercomputing Center
Spain

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average