
The DARE Database is a set of handwritten character dates derived from different historical sources from Sweden and Denmark. Additional details are available on our GitHub and on arXiv. There are seven splits provided in this dataset representing the different data sources. Each folder contains the respective minipics and their labels split into test and training files. The number of files and tokens are: Train images: 2,876,752Test images: 152,414Total number of images: 3,029,166Total number of tokens: 9,682,027 Which is further explained in the following table: Datasets Sequence Training Observations Test Observations Death Certificates (1) DD-MM-YYYY 11,627 1,000 Death Certificates (2) DD-MM-YYYY 155,439 8,338 Police Records (1) DD-MM-YY 1,006,199 53,488 Police Records (2) DD-MM-YY 326,478 17,103 Swedish Records Birth Dates DD-MM-YY 597,756 31,389 Swedish Records Death Dates DD-MM 547,813 28,803 Funeral Records DD-MM 231,440 12,293 Note that for data restriction reasons, the CIHVR images are excluded (as we do not have permission to publicly share those). The only exception to our images consisting purely of digits arise from the month in the date sequences which sometimes is written with alphabetic characters, e.g., "February" or "Feb". The original images are acquired from Copenhagen Archives, the National Archives of Denmark, and Lund University. The minipics are created using Coherent Point Drift to extract the regions of interest from the source documents. One comment about the Swedish cause of death records is that a lot of these are labelled as either empty or partly empty. Partly empty, e.g., ' 29-" ' represents that the cell with respect to the month is in fact not empty but rather that the month is the same as above. It is quite common in many historical tabulated records that they use a special mark for notating the same as above. The other cells labelled as ' ,-,-, ' for birth dates or ' ,-, ' for death dates are completely empty cells and could be excluded for pure digit recognition models. However, for transcribing historical records, empty cells are frequently represented and should be taken into account one way or another.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
