Multilingual Synthetic Corpora for Geoparsing Using Large Language Models.

The construction of annotated geoparsing corpora is costly and has resulted in limited linguistic and geographic coverage, particularly outside English-speaking regions. This paper investigates the use of Large Language Models (LLMs) to generate synthetic geoparsing corpora for four languages in four regions: Austria (German), Belarus (Belarusian), Finland (Finnish), and Ghana (English). We evaluate the generated corpora through automatic and human quality checks, and by benchmarking state-of-the-art geoparsers on the synthetic data. Our results show that while LLM-generated corpora enable credible geoparser evaluation, low-resource regions and languages expose systematic limitations in LLM-powered synthetic text generation approaches related to underlying geographic data coverage, completeness, and linguistic variation. We make our code available on GitHub.

Peer reviewed

Country

Finland

Related Organizations

University of Helsinki
Finland

Keywords

large language model, toponym recognition, Geosciences, geoparsing

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Related to Research communities

UArctic