MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities

Abstract: We introduce the MLM (Multiple Languages and Modalities) dataset - a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic data provide a resource that further tests the ability for multitask systems to learn relationships between entities. The dataset is designed for researchers and developers who build applications that perform multiple tasks on data encountered on the web and in digital archives. The second version of MLM provides a geo-representative subset of the data with weighted samples for countries of the European Union. We demonstrate the value of the resource in developing novel applications in the digital humanities with a motivating use case and specify a benchmark set of tasks to retrieve modalities and locate entities in the dataset. Evaluation of baseline multitask and single-task systems on the full and geo-representative versions of MLM demonstrate the challenges of generalizing on diverse data. In addition to the digital humanities, we expect the resource to contribute to research in multimodal representation learning, location estimation, and scene understanding. Introduction: Multiple Languages and Modalities comprises data points on 236k human settlements for evaluating and optimizing multitask learning systems. MLM presents a dataset with a high level of diversity in terms of modality and language. For each entity, we have extracted text summaries, images, coordinates, and their respective triple classes. Text summaries are available in three languages (English, French, and German) with each entity having between one and three language entries. Human settlements from all continents are provided in the overall dataset (MLM) with 72% located in Europe. Two further versions of the dataset - MLM-irle and MLM-irle-gr - were generated for use in the benchmark evaluation for multitask systems described in the paper (see above). MLM-irle-gr (ie geo-representative) was generated to serve organizations that focus on the European Union by providing a geographically balanced coverage of human settlements in this region. MLM-irle-gr contains data on 24k human settlements across the EU weighted in relation to the population count for each of the 28 countries. MLM contains the following fields: ---------------------------------------------------------------------- # field-label description ---------------------------------------------------------------------- 1. id a unique identifier 2. label textual label 3. coordinates longitude, latitude geo-location value 4. summaries list of textual summaries related to the entity 5. images list of images related to the entity 6. classes list of associated triple class ---------------------------------------------------------------------- MLM - Details by Dataset Version: ----------------------------------------------------------- Num. of MLM MLM-irle MLM-irle-gr ----------------------------------------------------------- Entities 236496 218681 22501 Images 412422 314533 31621 Summaries 497899 462328 47508 Triple classes 1685 1655 452 ----------------------------------------------------------- Availability: All three versions of MLM listed in the table directly above are available for direct download and use. To support findability and sustainability, the MLM dataset is published as an on-line resource at https://doi.org/10.5281/zenodo.3885753. A separate page with detailed explanations and illustrations is available at http://cleopatra.ijs.si/goal-mlm/ to promote ease-of-use. The project GitHub repository contains the complete source code for the system and the generation script is available at https://github.com/GOALCLEOPATRA/MLM. Documentation adheres to the standards of FAIR Data principles with all relevant metadata specified to the research community and users. It is freely accessible under the Creative Commons Attribution 4.0 International license, which makes it reusable for almost any purpose. Updating and Reusability: MLM is supported by a team of researchers from the University of Bonn, the Leibniz Information Center for Science and Technology, and Jožef Stefan Institute. The resource is already in use for individual projects and as a contribution to the project deliverables of the Marie Skłodowska-Curie CLEOPATRA Innovative Training Network. In addition to the steps above that make the resource available to the wider community, the usage of MLM will be promoted to the network of researchers in this project. Use among researchers and practitioners in digital humanities will be promoted by demonstrations and presentations at domain-related events. Activities are planned for the Digital Methods Summer School run by the University of Amsterdam. The range of modalities and languages present in the dataset also extend its application to research on multimodal representation learning, multilingual machine learning, information retrieval, location estimation, and the Semantic Web. MLM will be supported and maintained for three years in the first instance. A second release of the dataset is already scheduled and the generation process outlined above is designed to enable rapid scaling.

Related Organizations

Leibniz Association
Germany
University of Bonn
Germany
Jožef Stefan International Postgraduate School
Slovenia
German National Library of Science and Technology
Germany

Keywords

Machine Learning, Multitask learning, Multimodal data, Multilingual data

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average