Text Restoration of Historical Documents

This PhD project investigates the application of pre-trained language models (PLMs) to the automated restoration of Latin diplomatic texts, with a focus on medieval notary documents. The project addresses a significant challenge in historical document studies: the reconstruction of damaged or missing text in low-resource Latin corpora. To this end, the project systematically evaluates a range of PLMs that vary in architecture, training language, and scale, to identify the most effective approach for this specialised restoration task. The project is structured around the following research questions: Does adding Ancient Greek and English during pre-training improve performance in Latin text restoration, or is monolingual pre-training exclusively on Latin more effective? How does the performance of smaller, domain-specific models fine-tuned on Latin compare to large-scale commercial large language models using few-shot prompting in the context of Latin text restoration? The experimental design distinguishes between two key settings based on whether the length of the missing text is known or unknown, which leads to the evaluation of both encoder-based models and encoder-decoder or decoder-only models. Controlled comparisons between model pairs which share identical architectures but differing in training data allow for a rigorous assessment of the effect of multilingual pre-training on downstream Latin text restoration tasks.

Found an issue? Give us feedback