Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Conference object . 2025
License: CC BY
Data sources: Datacite
ZENODO
Conference object . 2025
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Representing questionnaire items and response sets as semantic embeddings allows efficient identification of similar constructs despite differing textual representations

Authors: Krishnamurthy, Madan; Dave, Leena; Slade, Timothy; Marcial, Laura Haak; Montavon, Joel; Tyndall, Benjamin; Ortiz, Jacqueline; +1 Authors

Representing questionnaire items and response sets as semantic embeddings allows efficient identification of similar constructs despite differing textual representations

Abstract

The datasets hosted in NHLBI's BioData Catalyst® (BDC) ecosystem include records derived from surveys and clinical questionnaires. Absent common data elements--whether mandated by funders or resulting from research community consensus--it is common for similar constructs (e.g. employment status, financial insecurity, or medical cost burden) to be instantiated with different variable names and/or ranges of valid response values. This representational heterogeneity poses a significant challenge along multiple dimensions of data FAIRness (Findability, Accessibility, Interoperability, and Reusability). In the BDC context, it can inhibit discoverability, data harmonization, and creation of dataset- and study-spanning cohorts. To address these issues we develop a scalable approach to identifying {questionnaire item} + {response set} pairs assessing similar constructs while being instantiated differently. Leveraging a form of data representation known as an "embedding", we represent the semantics of these pairs in a way that permits computation of similarity scores. We can then automatically categorize highly-similar and highly-divergent embeddings and flag ambiguously-similar pairs for human review. Our approach can drastically reduce the labor required to execute such mapping exercises. This poster presents the results of a proof-of-concept application of this method using variables from the Gravity Project, data dictionaries from four BioData Catalyst datasets, and a marking schema based upon Simple Knowledge Organization System (SKOS) relations. Performance varied by domain, with employment status being the best (F1-Score = 1.0) and financial insecurity being the worst (F1-Score = 0.42). Some domains, including financial insecurity, material hardship, and medical cost burden had significant overlap and were challenging for human annotators to differentiate. Future work includes further refinement of this workflow by comparing the performance of different embedding algorithms, examining performance on categorical variables versus continuous variables, determining binning for semantic similarity scores (high, medium, low), and exploring the possibility of other vocabularies or annotating data with multiple domains.

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!