Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2021
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2021
License: CC BY
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2021
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2021
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2021
License: CC BY
Data sources: ZENODO
versions View all 3 versions
addClaim

Dataset for Logical-layout analysis on French historical newspapers

Authors: Nicolas Gutehrlé; Iana Atanassova;

Dataset for Logical-layout analysis on French historical newspapers

Abstract

Dataset for Logical-layout analysis on French Historical Newspapers This is a dataset for training and testing logical-layout analysis and recognition system on French historical documents published between 1900 and 1950. The original data is part of the "Fond régional: Franche Comté", which is curated by Gallica, the digital portal of the Bibliothèque nationale de France (BnF). This dataset is divided into a train and a test set. The train and test datasets have been designed to cover as much as possible the various possible layouts that exist in the "Fond régional: Franche Comté" dataset. To do so, we have divided them into three layout-types: * 1c: documents where the text is displayed in one column, as in books; * 2c: documents where the text is displayed into two columns; * 3c+: documents where there are at least 3 columns of text, as in newspapers. Each of these folders contain subfolders prefixed by ‘cb’. These are the identifier of a newspaper collection such as « Le Semeur ». An XML describing the collection is contained in each folder, which is not related to the logical-layout analysis purpose. The folders also contain subfolders prefixed by ‘bpt’, with the following files: * XXX.xml : the original XML file as gathered from Gallica. * truelabels_block: A CSV file where the True labels for each TextBlock tag are given. Each line contains the page, the block_id, the first and the last line of text of the block and its label; * truelabels_line: A CSV file where the True labels for each TextLine tag are given. Each line contains the page, the line_id, the text of the line and its label; * XXX_docbook.xml: the document that has been processed by a Logical Layout recognition system. The XXX.xml file, which is the original file as stored on Gallica, provides multiple information on the document, such as: * Metadata, which follows the DublinCore format * Pagination * OCR, which follows the XML ALTO format The OCR output for the whole document is available in a PrintSpace tag. Lines of text are contained in TextLine tags, which in their turn contain String tags for words and SP tags for spaces. TextLine tags are grouped into blocks in TextBlock tags. The truelabel_block.csv file indicates the True logical label for each TextBlock tag in the document. The possible labels are Text, Title, Header and Other. Similarly, the truelabel_lines.csv file indicates the True logical label for each TextLine tagin the document. The possible labels are Text, Firstline (to indicate the first line of a paragraph), Title, Header or Other. Each line in these documents contain an id, respectively for a block or a line of text, which is found in the OCR section of XXX.xml file. The XXX_docbook.xml file has been obtained by a rule-based Logical Layout recognition system. It contains the text from the OCR section of the XXX.xml file, surrounded by tags that correspond to logical labels. The possible tags are Header, Title, Para (for paragraph), Sent (for sentences) and Other. Because these files were automatically generated, the labels may differ from the one given in the CSV files. You can access the original scan of every document on the Gallica website. To do so, use the following URL by replacing <IDENTIFIER> with the id of the document (eg: bpt6k76208717) : https://gallica.bnf.fr/ark:/12148/<IDENTIFIER>

Keywords

Historical newspapers, Logical layout, Natural Language Processing

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 17
    download downloads 3
  • 17
    views
    3
    downloads
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
download
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
downloads
OpenAIRE UsageCountsDownloads provided by UsageCounts
0
Average
Average
Average
17
3
Related to Research communities
STARS EU