A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type , Preprint 26 Mar 2021 United Kingdom, United States, Netherlands, Korea (Republic of) English Publisher:Elsevier BVJournal:Computer Methods and Programs in Biomedicine, volume 211, page 106,394 (issn: 0169-2607,

Copyright policy )Funded by:EC | EHDEN

Authors: Khalid, Sara; Yang, Cynthia; Blacketer, Clair; Duarte-Salles, Talita; Fernández-Bertolín, Sergio; Kim, Chungsoo; Park, Rae Woong; +7 Authors

doi: 10.1016/j.cmpb.2021.106394 , 10.1101/2021.03.23.21254098

pmid: 34560604

pmc: PMC8420135

A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data

- Summary
- Subjects
- Metrics

Abstract

Background and Objective As a response to the ongoing COVID-19 pandemic, several prediction models have been rapidly developed, with the aim of providing evidence-based guidance. However, no COVID-19 prediction model in the existing literature has been found to be reliable. Models are commonly assessed to have a risk of bias, often due to insufficient reporting, use of non-representative data, and lack of large-scale external validation. In this paper, we present the Observational Health Data Sciences and Informatics (OHDSI) analytics pipeline for patient-level prediction as a standardized approach for rapid yet reliable development and validation of prediction models. We demonstrate how our analytics pipeline and open-source software can be used to answer important prediction questions while limiting potential causes of bias (e.g., by validating phenotypes, specifying the target population, performing large-scale external validation and publicly providing all analytical source code). Methods We show step-by-step how to implement the pipeline for the question: ‘In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization’. We develop models using six different machine learning methods in a US claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the US. Results Our open-source tools enabled us to efficiently go end-to-end from problem design to reliable model development and evaluation. When predicting death in patients hospitalized for COVID-19 adaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated. Conclusion Our results show that following the OHDSI analytics pipeline for patient-level prediction can enable the rapid development towards reliable prediction models. The OHDSI tools and pipeline are open source and available to researchers around the world.

Countries

United Kingdom, United States, Netherlands, Korea (Republic of)

Related Organizations

Erasmus University Rotterdam
Netherlands
Yonsei University
Korea (Republic of)
University of California
United States
Erasmus University Medical Center
Netherlands
University of California, Los Angeles
United States

View all View all

Keywords

Artificial Intelligence and Image Processing, COVID19, observational health data, COVID-19*, Biomedical Engineering, 610, Bioengineering, data harmonization, Article, Machine Learning, risk prediction, OHDSI, Information and Computing Sciences, Machine learning, Humans, Electrical and Electronic Engineering, Data quality control, Pandemics, distributed data network, data quality control, prediction modeling, SARS-CoV-2, Applied computing, Software Engineering, Data harmonization, COVID-19, Computer vision and multimedia computation, Risk prediction, 004, Pandemics*, Phenotypes, Distributed data network, machine learning, Logistic Models, Networking and Information Technology R&D (NITRD), Generic health relevance, Biomedical engineering, Medical Informatics

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	32
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%