
Abstract Winter storms are among Europe’s costliest natural hazards, yet models that predict their damage remain scarce due to a lack of granular damage data. This has also hindered assessments of whether flexible machine-learning methods and novel, high-resolution hazard and exposure data improve predictive performance. We address this gap by applying three increasingly flexible model classes-generalised linear models, generalised additive models and XGBoost-to a unique insurance dataset containing approximately one million residential building insurance policies in force during nine winter storms in the Netherlands (9.4 million policy-storm observations in total). Using leave-one-storm-out cross-validation and an external extrapolation test, we evaluate out-of-sample residential building damage prediction across nested feature sets comprising high-resolution wind gusts, precipitation, atmospheric thermodynamic state variables (air pressure, air temperature and relative humidity), building characteristics and location effects. Feature selection affects prediction accuracy more than model class, and once informative features are included, models with different levels of flexibility and interpretability perform similarly. We therefore find no clear trade-off between interpretability and predictive performance. The best models achieve near-zero aggregate bias and an average absolute storm-level error below 38%, with errors largely attributable to uncaptured damage-relevant information. Relative to an open-source European wind-damage benchmark, our portfolio-calibrated approach reduces storm-level prediction errors by about 90%, demonstrating the value of tailored calibration over generic damage models. Future efforts to improve these models should prioritise the collection of relevant, high-resolution damage and feature data, especially when predicting damage from extreme winter storms.
