• shareshare
  • link
  • cite
  • add
Publication . Conference object . 2017

The Karjala database – challenges and solutions for digitizing heterogeneous, old genealogical documents for internet use

Saarti, Jarmo; Ropponen, Jari; Soivanen, Satu;
Published: 15 Aug 2017
Publisher: HAL CCSD

International audience; The Karjala database contains digitized demographic data of the parish registers from the regions ceded to the Soviet Union in 1944. The objectives of the digitization project have been to promote access to digitized records for scientific research and genealogy as well as encouraging research on the people of the ceded Karelia region. The main sources for the database have been catechetical lists, lists of children, and registers of vital statistics (registers of births, marriages, migrations and deaths) that are available in Digital Archives of the National Archives of Finland from the period of 1681 – 1949. The data in the database amounts to about 10.3 million entries, but only data older than 100 years is published openly on the Internet. According to decisions by the Finnish data protection authorities, the Personal Data Act is applied to personal registers less than 100 years old. The digitization process is still going on; it has been calculated that there are 1.2 million entries still to be processed. The database is available to users via At present, there are about 6.5 million file entries available on the Internet, each presenting data about one individual, e.g. names, the date of birth and death, the cause of death, age, gender, marital status, occupation, residence, migration, the parish. The Karjala database can be exploited for diverse research purposes; it improves access to the church records that are sometimes very difficult to read. Information in the database can be utilized for historical research, medical genetics, social sciences, and family and onomastics. The database is can be utilized for clarifying family structures, migratory patterns or child mortality. The database also offers excellent opportunities for interdisciplinary research. Our presentation will describe the digitization process management of old, handwritten documents that consist of non-structured data from a historical period that contains varied linguistic material: several languages from a historical period where nations, states and languages were still evolving, different calendars and spelling rules etc. We will also introduce our plans to use text recognition technology so that the handwritten documents such as the Karjala database will be incorporated into the international READ project network We will also discuss the challenges encountered in this type of heterogeneous data and the possibilities for more defined and structured data management that could enable the automated use of the database. We will also include in our presentation a description of the evolution of the different phases of the database, emphasizing the evolution of the database and its linkage with internet technologies e.g. how they have either hindered or enabled the digitization project.


genealogical documents, handwriting, Karelia, Finland, digitization, [INFO.INFO-DL]Computer Science [cs]/Digital Libraries [cs.DL]

Blanke, Tobias and Bryant, Michael and Hedges, Mark (2012). Open Source Optical Character Recognition for Historical Research. Journal of Documentation (68): 659 - 83.

Cimtech (2011). Crowdsourcing Project Helps Finnish Library to Digitise Historical Documents. Information Management & Technology (IM@T.Online).

Parry, Marc (2012). Historians Ask the Public to Help Organize the Past; but is the Crowd Up to it?" The Chronicle of Higher Education 59(2). &it=r&p=AONE&sw=w (accessed 22.6.2017).

Related to Research communities
Download from