
This dataset has been used to fine-tune and evaluate a DAN model (Document Attention Network) to perform information extraction from 19th century land registry documents (initial registers, états de sections en français). Training, evaluation and test subsets have already been created. Images have been digitized by the French Archives of Val-de-Marne departement. They are grouped by town, wich means that images from one town (aka from one same register) can't be in many subsets. Images description Additional documentation to come. Columns (entities in atr-DAN) ancien_numero_parcelle : former plot number (given in only one table type on three) ancienne_nature : former plot nature (given in only one table type on three) identite : taxpayer indentity lieu-dit : plot address nature : plot nature numero_parcelle : plot number numero_proprietaire : taxpayer id in the next register Additionnal tokens Text includes some special tokens that are used to represent additionnal layout informations like : → : back to a new line ↑TEXT↓ : exponent ×TEXT± : crossed out text (most of the time, means outdated or erroneous information) Notes This dataset has been formated using the atr-dan Python library. It means that you can skip the dataset generation step if you use the scripts available on the Git-Hub repositoty of the TPDL paper.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
