
<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>
Article GT guidelines for Newseye (as of March 2020) Article resp. 'news item' - An article or news item is defined as a piece of content which can clearly be separated from other similar pieces by its content. It comprises therefore not only "articles" but also advertisements, classified advertisements and other contributions within a newspaper. - The PAGE XML contains for each line the custom tag 'structure' with type 'article' and the id of the individual article. All lines with the same id belong to the same article. There are five different types of regions: TextRegion, Graphic/ImageRegion, TableRegion, AdvertRegion/ClassifiedAdvertRegion, SeparatorRegion. In detail: - The TextRegions are located at the text block level. In addition, the individual TextRegions do not overlap - If the blocks also appear in a Graphic/ImageRegion a TextRegion was created. - For tables two regions were created: TableRegion and TextRegion of the same size (within a table, no text blocks need to be marked) - An AdvertRegion is not only advertising, but also general/classified advertisements (e.g. death ads). - Only the visible (both horizontal & vertical) separators are captured by a separator region. Additionally, structure tags for TextRegions are defined as paragraph, heading, caption, enumeration: - TextRegions that mark ordinary blocks of text are tagged with the 'paragraph' tag. - Definition of 'heading' should be rather clear. Subheadings get also marked as 'heading'. The reason is to not introduce an additional structure tag since it is sufficient to have only one. - 'Captions' can be found obviously beneath images and graphics. - Enumerations: the individual text blocks are marked, but additionally around the entire enumeration a text region was drawn with the structure type 'enumeration'.
The dataset comprises French newspaper pages from 19th and early 20th century with annotated text. The page images were provided by the French National Library and comprise 184 pages (training set). The data are formed according to the PAGE format (cf. Cf. https://github.com/PRImA-Research-Lab/PAGE-XML/) and were produced with the Transkribus platform with support of the NewsEye and the READ project. The guidelines for creating AS GT were added to the 'Additional notes'.
Transkribus, Article Separation
Transkribus, Article Separation
citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 2 | |
popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
views | 30 | |
downloads | 17 |