
handle: 10138/631482
The construction of annotated geoparsing corpora is costly and has resulted in limited linguistic and geographic coverage, particularly outside English-speaking regions. This paper investigates the use of Large Language Models (LLMs) to generate synthetic geoparsing corpora for four languages in four regions: Austria (German), Belarus (Belarusian), Finland (Finnish), and Ghana (English). We evaluate the generated corpora through automatic and human quality checks, and by benchmarking state-of-the-art geoparsers on the synthetic data. Our results show that while LLM-generated corpora enable credible geoparser evaluation, low-resource regions and languages expose systematic limitations in LLM-powered synthetic text generation approaches related to underlying geographic data coverage, completeness, and linguistic variation. We make our code available on GitHub.
Peer reviewed
large language model, toponym recognition, Geosciences, geoparsing
large language model, toponym recognition, Geosciences, geoparsing
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
