
The datasets hosted in NHLBI's BioData Catalyst® (BDC) ecosystem include records derived from surveys and clinical questionnaires. Absent common data elements--whether mandated by funders or resulting from research community consensus--it is common for similar constructs (e.g. employment status, financial insecurity, or medical cost burden) to be instantiated with different variable names and/or ranges of valid response values. This representational heterogeneity poses a significant challenge along multiple dimensions of data FAIRness (Findability, Accessibility, Interoperability, and Reusability). In the BDC context, it can inhibit discoverability, data harmonization, and creation of dataset- and study-spanning cohorts. To address these issues we develop a scalable approach to identifying {questionnaire item} + {response set} pairs assessing similar constructs while being instantiated differently. Leveraging a form of data representation known as an "embedding", we represent the semantics of these pairs in a way that permits computation of similarity scores. We can then automatically categorize highly-similar and highly-divergent embeddings and flag ambiguously-similar pairs for human review. Our approach can drastically reduce the labor required to execute such mapping exercises. This poster presents the results of a proof-of-concept application of this method using variables from the Gravity Project, data dictionaries from four BioData Catalyst datasets, and a marking schema based upon Simple Knowledge Organization System (SKOS) relations. Performance varied by domain, with employment status being the best (F1-Score = 1.0) and financial insecurity being the worst (F1-Score = 0.42). Some domains, including financial insecurity, material hardship, and medical cost burden had significant overlap and were challenging for human annotators to differentiate. Future work includes further refinement of this workflow by comparing the performance of different embedding algorithms, examining performance on categorical variables versus continuous variables, determining binning for semantic similarity scores (high, medium, low), and exploring the possibility of other vocabularies or annotating data with multiple domains.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
