Representing questionnaire items and response sets as semantic embeddings allows efficient identification of similar constructs despite differing textual representations

Krishnamurthy, Madan; Dave, Leena; Slade, Timothy; Marcial, Laura Haak; Montavon, Joel; Tyndall, Benjamin; Ortiz, Jacqueline; Thessen, Anne

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Conference object . 2025

License: CC BY

Data sources: Datacite

ZENODO

Conference object . 2025

License: CC BY

Data sources: Datacite

Representing questionnaire items and response sets as semantic embeddings allows efficient identification of similar constructs despite differing textual representations

descriptionPublicationkeyboard_double_arrow_right Conference object 03 Dec 2025Publisher:Zenodo

Authors: Krishnamurthy, Madan; Dave, Leena; Slade, Timothy; Marcial, Laura Haak; Montavon, Joel; Tyndall, Benjamin; Ortiz, Jacqueline; +1 Authors

doi: 10.5281/zenodo.18880255 , 10.5281/zenodo.18880256

Representing questionnaire items and response sets as semantic embeddings allows efficient identification of similar constructs despite differing textual representations

- Summary
- Metrics

Abstract

The datasets hosted in NHLBI's BioData Catalyst® (BDC) ecosystem include records derived from surveys and clinical questionnaires. Absent common data elements--whether mandated by funders or resulting from research community consensus--it is common for similar constructs (e.g. employment status, financial insecurity, or medical cost burden) to be instantiated with different variable names and/or ranges of valid response values. This representational heterogeneity poses a significant challenge along multiple dimensions of data FAIRness (Findability, Accessibility, Interoperability, and Reusability). In the BDC context, it can inhibit discoverability, data harmonization, and creation of dataset- and study-spanning cohorts. To address these issues we develop a scalable approach to identifying {questionnaire item} + {response set} pairs assessing similar constructs while being instantiated differently. Leveraging a form of data representation known as an "embedding", we represent the semantics of these pairs in a way that permits computation of similarity scores. We can then automatically categorize highly-similar and highly-divergent embeddings and flag ambiguously-similar pairs for human review. Our approach can drastically reduce the labor required to execute such mapping exercises. This poster presents the results of a proof-of-concept application of this method using variables from the Gravity Project, data dictionaries from four BioData Catalyst datasets, and a marking schema based upon Simple Knowledge Organization System (SKOS) relations. Performance varied by domain, with employment status being the best (F1-Score = 1.0) and financial insecurity being the worst (F1-Score = 0.42). Some domains, including financial insecurity, material hardship, and medical cost burden had significant overlap and were challenging for human annotators to differentiate. Future work includes further refinement of this workflow by comparing the performance of different embedding algorithms, examining performance on categorical variables versus continuous variables, determining binning for semantic similarity scores (high, medium, low), and exploring the possibility of other vocabularies or annotating data with multiple domains.

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now