Predicting Spontaneous Pneumothorax Recurrence with Machine Learning: A Synthetic Example

Aim: Recurrence after primary spontaneous pneumothorax (PSP) remains clinically relevant and may influence the intensity of follow-up and the choice of interventions. Reported recurrence rates vary widely across cohorts. Machine learning (ML) can complement conventional risk stratification by combining multiple predictors into an individualized probability estimate. Methodology: We generated a synthetic dataset of 1,000 patients with a 12-month recurrence prevalence of 50% to demonstrate an end-to-end supervised ML workflow. Predictors were constructed to mimic common clinical and imaging-derived variables (age, sex, smoking exposure, bleb size, emphysema score, prior pneumothorax, treatment strategy, and a muscle-mass proxy). We compared penalized logistic regression with a random forest classifier, using a stratified train/test split. Model performance was assessed by discrimination (ROC-AUC), overall accuracy (Brier score), calibration intercept/slope, and decision curve analysis (DCA) for clinical utility. Results: On the held-out test set, logistic regression achieved ROC-AUC 0.7633 and Brier score 0.1989; the random forest achieved ROC-AUC 0.7501 and Brier score 0.2055. Calibration intercept/slope were -0.0910/1.1853 for logistic regression and -0.0438/1.2649 for the random forest. Both models showed positive net benefit at decision thresholds of 0.30 and 0.50. Conclusion: This synthetic example illustrates key practical steps (data preparation, model training, evaluation, and reporting) and common pitfalls (data leakage, overfitting, and miscalibration). For real-world deployment, transparent reporting and external validation are essential.

Keywords

Machine Learning, prediction model, Pneumothorax

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now