Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2018
License: CC BY SA
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2018
License: CC BY SA
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2018
License: CC BY SA
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2018
License: CC BY SA
Data sources: ZENODO
versions View all 2 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Tesseract OCR models for the Alsatian dialects

Authors: Bernhard, Delphine;

Tesseract OCR models for the Alsatian dialects

Abstract

This dataset provides trained Tesseract (https://github.com/tesseract-ocr/tesseract) OCR models for the Alsatian dialects. These models were developed in the context of the RESTAURE project, funded by the French ANR. Two models are provided : The first model, ISKO_2015, has been presented in the following article: https://hal.archives-ouvertes.fr/hal-01252241. The Tesseract model has been trained using the jTessBoxEditor tool (http://vietocr.sourceforge.net/training.html), Version 1.4 (2 May 2015), based on images automatically generated from the training texts (excerpts from 7 different printed works, totalling about 9,000 words). The generation of the images used a 36pt font size, and two fonts were used (Arial and Times New Roman), with their normal and italic variants. The Tesseract model (gsw.traineddata) can be used with Tesseract 3.0x. The second model, 2018, has been trained for Tesseract 4.0x, using jTessBoxEditor version 2.0.1 (28 July 2018). Again, images were automatically generated from the training text. The training text is different from the one used for the ISKO_2015 model and is "artificial", in the sense that it has been built by appending word n-grams extracted from a large variety of published texts in Alsatian, for a time period spanning 2 centuries and for different text genres. The images corresponding to this training text have been automatically generated with the Tesseract text2image tool, using the following parameters: --ptsize=36 --leading=20. The fonts used are listed in the gsw.font_properties file. Dictionary data has also been used for training. We conflated Alsatian words found in several lexicons and corpora: Lexicons produced by the OLCA (Office pour la Langue et les Cultures d'Alsace et de Moselle): http://www.olcalsace.org/fr/lexiques Lexicon from a Wiktionary user page: https://fr.wiktionary.org/wiki/Utilisateur:Laurent_Bouvier/alsacien-fran%C3%A7ais Lexicon from the ACPA association: http://web.archive.org/web/20160302234127/http:/culture.alsace.pagesperso-orange.fr/dictionnaire_alsacien.htm Chronicles published by Raymond Matzen in the local newspaper "Les Dernières Nouvelles d'Alsace" Transcriptions of television shows found in Erhart, P. (2012). Les dialectes dans les médias: quelle image de l’Alsace véhiculent-ils dans les émissions de la télévision régionale?, Université de Strasbourg, http://www.theses.fr/167563386 French-Alsatian parallel corpus provided by the OLCA Excerpts from Adolf, P. (2006). Dictionnaire comparatif multilingue: français-allemand-alsacien-anglais., Strasbourg, France, Midgard, 2006, 373 p. The Tesseract models can be used for instance using the gImageReader tool (https://github.com/manisandro/gImageReader), which provides a graphical user interface for the Tesseract tool. When evaluated against the same test corpus (prose by Marie Hart, theater and poetry by Gustave Stokopf and prose by Charles Zumstein, totalling about 4,900 words), both models achieve roughly the same performance levels. Usually, even better performance levels can be achieved by combining the Alsatian-specific model with the French and German models available for Tesseract (available from https://github.com/tesseract-ocr/tessdata)

Related Organizations
Keywords

OCR, Tesseract, Alsatian

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 20
    download downloads 4
  • 20
    views
    4
    downloads
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
download
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
downloads
OpenAIRE UsageCountsDownloads provided by UsageCounts
0
Average
Average
Average
20
4