CM1 Dataset

This is the CM1-Dataset designed for the evaluation of information extraction from historical documents with Large Vision Language Models. Paper: https://arxiv.org/abs/2505.04214 GitHub: https://github.com/fabiwo6/cm1 Abstract The automatic extraction of key-value information from handwritten documents is a key challenge in document analysis. A reliable extraction is a prerequisite for the mass digitization efforts of many archives. Large Vision Language Models (LVLM) are a promising technology to tackle this problem especially in scenarios where little annotated training data is available. In this work, we present a novel dataset specifically designed to evaluate the few-shot capabilities of LVLMs. The CM1 documents are a historic collection of forms with handwritten entries created in Europe to administer the Care and Maintenance program after World War Two. The dataset establishes three benchmarks on extracting name and birthdate information and, furthermore, considers different training set sizes. We provide baseline results for two different LVLMs and compare performances to an established full-page extraction model. While the traditional full-page model achieves highly competitive performances, our experiments show that when only a few training samples are available the considered LVLMs benefit from their size and heavy pretraining and outperform the classical approach. Annotations cm1_cover_*.json: "document_id": [{"Name": "last_name_person_1", "Vorname": "first_name_person_1", "Geb-Dat": "birth_date_person_1"}, {"Name": "last_name_person_2", "Vorname": "first_name_person_2", "Geb-Dat": "birth_date_person_2"}], cm1_namedate_*.txt cluster_id/document_id.jpg first_name last_name birth_date

Related Organizations

Osnabrück University
Germany
TU Dortmund University
Germany

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average