Leveraging AI for Enhanced Archaeological Data Extraction: Workflows for Textual and Image-Based Data

descriptionPublicationkeyboard_double_arrow_right Presentation , Other literature type 07 May 2025Embargo end date: 03 Jun 2025 English Publisher:Archaeological Information System of the Czech Republic (AIS CR)Funded by:EC | ATRIUM

Authors: Pajdla, Petr; Novák, David; Harasim, Ronald; Křivánková, Dana; Straňák, Pavel; Lutsai, Kateryna; Lečbychová, Olga;

doi: 10.5281/zenodo.15582856 , 10.5281/zenodo.15582855

Leveraging AI for Enhanced Archaeological Data Extraction: Workflows for Textual and Image-Based Data

- Summary
- Subjects
- Metrics

Abstract

The digitization of archaeological archives, particularly grey literature and archival photographs, holds immense potential for knowledge discovery. However, manual processing of such data is labour-intensive and often lacks consistency, making it a prime candidate for automation. This paper presents pilot implementation of re-usable digital research workflows that integrate text and image recognition technologies and AI models to streamline the analysis of archaeological documentation. These workflows are being developed for the purposes of enhancing (meta)data quality in the Archaeological Map of the Czech Republic (AMCR) digital repository and the ARIADNE Knowledge Base and discovery service. Our approach to textual data leverages OCR/HTR and NLP tools to process archival reports, generating machine-readable text from a combination of manuscripts, typescripts, and printed materials. Through AI-driven information extraction techniques, we prepare models for automated segmentation and OCR/HTR processing of documents. These are implemented through the e-Scriptorium service and a newly developed dashboard. Based on the recognition outputs, LINDAT/CLARIAH-CZ open-source tools are applied for enhanced full-text search (tokenization, tagging, lemmatization, etc.; UDPipe), identification of keywords (KER), and named entities recognition (personal and place names, temporal data, AMCR vocabulary terms, identifiers, etc.; NameTag). The desired goal is to provide an integrated solution that will enable processing of legacy data and new uploads to the AMCR system and offer users more efficient services for searching and processing documents. A secondary objective is to simplify archival procedures by automating some of the steps involved in describing and archiving documents. Simultaneously, we implement object recognition workflow for the detection and classification of archaeological objects, i.e. artefacts and other objects of interest, in archival photographs. By adapting and fine-tuning deep learning models (e.g., ResNet) for archaeology, we segment and annotate archival photographs according to AMCR controlled vocabularies. Two types of image datasets are used, firstly the images with single finds, photographed often on standardised backgrounds with scales, and secondly images with various content including photographs from fieldwork with trenches, burials, etc. Mappings of the vocabularies used across the datasets to the Getty AAT terms ensures interoperability in the context of ARIADNE infrastructure. This workflow streamlines the process of annotating archival photographs with terms from domainspecific controlled vocabularies and allows identification of archaeological artefacts and other objects of interest, which simplifies the otherwise time-consuming task of creating metadata and at the same time opens new doors for connecting and cross-referencing image data with textual data, e.g. the grey literature find reports. The talk summarises the journey leading towards the implementation of both of the workflows, discusses what so far worked and what did not, including the dead ends we encountered and what we learned along the way. The current state of workflows’ implementation will be demonstrated on pilot results based on the archival textual and image documents, showcasing how AI technologies can enhance archaeological archives processing and foster further research.

Presentation from a talk given at CAA2025 Digital Horizons conference in session 19. Reusable Digital Research Workflows for Archaeology.

Related Organizations

Charles University
Czech Republic
Czech Academy of Sciences
Czech Republic
Institute of Archaeology of the Czech Academy of Sciences, Prague
Czech Republic
Institute of Archaeology of the Czech Academy of Sciences, Brno
Czech Republic

Keywords

Archaeology

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Funded by

EC| ATRIUM

Related to Research communities

DARIAH EU

Digital Humanities and Cultural Heritage

GOTRIPLE - Social Sciences and Humanities Discovery service