Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2020
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2020
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2020
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2020
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2020
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
versions View all 7 versions
addClaim

6000 ground truth of VOC and notarial deeds 3.000.000 HTR of VOC, WIC and notarial deeds

Authors: Keijser, Liesbeth;

6000 ground truth of VOC and notarial deeds 3.000.000 HTR of VOC, WIC and notarial deeds

Abstract

The National Archives of the Netherlands and Noord-Hollands Archief conducted a project using the Transkribus HTR (Handwritten Text Recognition) platform. The aim was to semi automatically transcribe 2 million pages of old Dutch texts. The transcribed archives are 17th and 18th century documents from the Dutch East-Asia Company (VOC). And 19th century notarial deeds from Noord-Hollands Archief and other archives in the provinces. In order to train the HTR software a team produced transcriptions of approximately 6000 scans. The scans are randomly selected from the dataset. With the transcriptions a model is trained that can recognize more than 90% of the characters correctly. Transkribus transcribed the 2 million scans automatically using the trained model. The following Transkribus HTR+ model has been trained for the text recognition: "IJsberg". More information about the model can be found here. See the chapter "Dutch Handwriting". However, the Transkribus team retrained the model with PyLaia technology, which improved the HTR+ model. This PyLaia model is not publicly available. Later on, 1 million extra scans concerning the West India Company (WIC) were transcribed automatically without adding extra ground truth or training. These archives are from the 17th and 18th century. The datasets published in Zenodo contain the ground truth (scans in JPG, transcription in PAGE XML) and the HTR results (in PAGE XML and TXT). See the overview below. Scroll to the bottom of the page to download the actual files. For more information on how the Dutch National Archive innovate on digital accessibility click here. For open data access of scans and inventories of the National Archives click here. Disclaimer: due to a variety of languages used and the bad state of the documents the HTR results of "1.05.21, Dutch series Guyana" can be of poor quality. -------------------------------------------------------------- Dataset HTR Dataset, name archive, number archive, inventory numbers, link to inventory) HTR results VOC, VOC, 1.04.02, 7527-9540, EAD HTR results 1.04.02, Oost-Indische Testamenten, 1.04.02, 6847-6897, EAD HTR results 1.05.01.01, Oude WIC, 1.05.01.01, 1-87, EAD HTR results 1.05.01.02, Tweede WIC, 1.05.01.02, 1-1382, EAD HTR results 1.05.02, Raad der Koloniën, 1.05.02, 1-192, EAD HTR results 1.05.03, Sociëteit van Suriname, 1.05.03, 1-566, EAD HTR results 1.05.05, Sociëteit van Berbice, 1.05.05, 1-445, EAD HTR results 1.05.06, Verspreide West-Indische stukken, 1.05.06, 1-1413, EAD HTR results 1.05.21, Dutch series Guyana, 1.05.21, AB.1.1-BB.7.1, EAD HTR results 2.01.28.01, West-Indisch comité, 2.01.28.01, 1-254, EAD HTR results 2.01.28.02, Raad der Amerikaanse Bezittingen, 2.01.28.02, 1-264, EAD HTR results NHA Notarial 1617, Oud notarieel archief Haarlem, 1617, 5-813, EAD HTR results NHA Notarial 1972, Nieuw notarieel archief Haarlem, 1972, 1593-1805 EAD Dataset Ground Truth (Name archive, number archive, inventory numbers, link to inventory, type of dataset) Dataset: Notarial deeds Ground Truths of the trainingset Oud notarieel archief Haarlem, 1617, 495 random scans from 5-813, EAD, GT Transcriptions Nieuw notarieel archief Haarlem, 1972, 952 random scans from 1593-1805, EAD, GT Transcriptions (And 168 transcripties from 7 other archives.) Dataset: Notarial deeds Images of the trainingset, Nieuw notarieel archief Haarlem, 1972, 952 random scans from 1593-1805, EAD, GT Scans Oud notarieel archief Haarlem, 1617, 495 random scans from 5-813, EAD, GT Scans (And 168 scans from 7 other archives.) Dataset: VOC Ground Truths of the trainingset, VOC, 1.04.02, 4735 random scans from 7527-9540, EAD, GT Transcriptions Dataset: VOC Images of the trainingset, VOC, 1.04.02, 4735 random scans from 7527-9540, EAD, GT Scans -------------------------------------------------------------- Version 3.0: The first HTR results from the VOC-collection are available in .txt format, Inventory numbers 7527-9540. Version 3.1: The HTR results from the VOC-collection are also available in PAGE xml format. Version 4.0: About 30 missing inventory numbers have been added to the VOC transcriptions. The HTR results of the Notarial Deeds from the NHA archives have been added. An example on full text searchable research can be found here (Dutch): https://kia.pleio.nl/groups/view/55812425/htr-en-ocr/blog/view/55814752/reconstructie-van-een-verijdelde-slavenopstand-met-behulp-van-automatische-handschriftherkenning-en-text-mining Version 5.0: Around a million pages of HTR results of the following archives have been added. Version 6.0: The HTR results of Oost-Indische Testamenten have been added.

Keywords

Nationaal Archief, Transkribus, Transciptions, Verenigde Oost-Indische Compagnie, Noord-Hollands Archief, West-Indische Compagnie, Notarial deeds

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    1
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
1
Average
Average
Average