Engineering a machine learning pipeline for automating metadata extraction from longitudinal survey questionnaires

descriptionPublicationkeyboard_double_arrow_right Article , Presentation , Other literature type , Conference object 28 Mar 2022 United Kingdom Publisher:University of Alberta LibrariesJournal:IASSIST Quarterly, volume 46 (issn: 0739-1137, eissn: 2331-4141,

Copyright policy )

Authors: Suparna De; Harry Moss; Jon Johnson; Jenny Li; Haeron Pereira; Sanaz Jabbari;

doi: 10.29173/iq1023 , 10.5281/zenodo.5742916 , 10.5281/zenodo.5742915

Engineering a machine learning pipeline for automating metadata extraction from longitudinal survey questionnaires

- Summary
- Subjects
- Metrics

Abstract

Data Documentation Initiative-Lifecycle (DDI-L) introduced a robust metadata model to support the capture of questionnaire content and flow, and encouraged through support for versioning and provenancing, objects such as BasedOn for the reuse of existing question items. However, the dearth of questionnaire banks including both question text and response domains has meant that an ecosystem to support the development of DDI ready Computer Assisted Interviewing (CAI) tools has been limited. Archives hold the information in PDFs associated with surveys but extracting that in an efficient manner into DDI-Lifecycle is a significant challenge. While CLOSER Discovery has been championing the provision of high-quality questionnaire metadata in DDI-Lifecycle, this has primarily been done manually. More automated methods need to be explored to ensure scalable metadata annotation and uplift. This paper presents initial results in engineering a machine learning (ML) pipeline to automate the extraction of questions from survey questionnaires as PDFs. Using CLOSER Discovery as a ‘training and test dataset’, a number of machine learning approaches have been explored to classify parsed text from questionnaires to be output as valid DDI items for inclusion in a DDI-L compliant repository. The developed ML pipeline adopts a continuous build and integrate approach, with processes in place to keep track of various combinations of the structured DDI-L input metadata, ML models and model parameters against the defined evaluation metrics, thus enabling reproducibility and comparative analysis of the experiments. Tangible outputs include a map of the various metadata and model parameters with the corresponding evaluation metrics’ values, which enable model tuning as well as transparent management of data and experiments.

Country

United Kingdom

Related Organizations

View all View all

Keywords

automated metadata extraction, H, machine learning, model provenance, hyperparameter tuning, DDI Lifecycle, Social Sciences, longitudinal surveys

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	5
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%