Propensity Score Matching (PSM) Python-based code

Some tips It has been tested and works with datasets in SPSS v25, 28 and 29 ('open script'). Python, v3.10 and 3.11. Regarding R, versions 4.3.0 and 4.4.0, and 'Reticulate' package, 1.39 and 1.40. It tolerates missing values acceptably. However, it is desirable to reduce them as much as possible. UsageRefine the code with your current research:- Rename C:\PATH_TO_YOUR_DATASET.sav- Rename COVS with your data (name, not label)- Choose the ratio (1:1, 1:2...) and the caliper - Choose bar colors and adjust the limits of the x-axis and y-axis to the desired range- Rename C:\PATH_TO_YOUR_FOLDERRun the script [RStudio, SPSS (File / Open script)...] All of them perform PS matching and store matched pairs. Features: * Code 1: Sampling without replacement. Five plots showing SMD and PS distributions. * Code 2: Sampling with replacement. Five plots. * Code 3: Sampling without replacement. In addition, a lineplot showing the SMD before and after matching. A colour assignment has been applied based on whether a covariate is included in the PS. It can be shown that PSM can also indirectly reduce the SMD of covariates not explicitly included in the PS model, due to underlying correlations or associations. * Code 4: Sampling without replacement. Focused on matching validation, it stores 3 statistics: - SMD (covariates included in PS): the objective is an absolute SMD postmatching <0.1 - VR (covariates included in PS): VR postmatching close to 1 - McFadden's pseudo-R² (postmatching close to zero, indicating that the covariates included in the PS model are no longer determinant of the variability of the DV)

This repository provides several variants of a free, Python-based code for performing propensity score (PS) matching. An initiative of the Camargo Cohort Study, developed with the aim of sharing the tool and spreading the use of PS matching. The code overcomes compatibility issues with R versions and R packages, and implements (i) logistic regression to compute PS, (ii) 1:N matching using the K-nearest neighbour (KNN) algorithm with a customisable caliper, (iii) sampling with or without replacement, (iv) visualisations to assess matching quality and (v) statistics to evaluate the balance. Outputs: Matched pairs stored as '.csv' file, allowing a Coxreg to be performed ('SET' in SPSS). Diagnostic plots stored in the specified output folder, providing a view of SMD and PS distribution. Statistics for matching validation: SMD, variance ratio (VR), McFadden's pseudo-R², and now, L1 multivariate imbalance. The code has been developed using information from the Matplotlib, Numpy and Seaborn libraries and with OpenAI's ChatGPT support and refinements. No funding was received for conducting this work and there are no financial or non-financial interests to disclose.

Full release New version (2025-03-03) combining codes 3 and 4. It provides: * Full PS matching with KNN algorithm, sampling without replacement, caliper=0.2 and ratio 1:1.* Matched pairs, SMD, variance ratios and pseudo-R^2, stored in a dedicated folder.* Lineplot with different colour for covs included in PS, tick spacing and covs names. It captures the critical processes of PSM: PS calculations, matching, validation and a diagnostic plot. As commented, it runs (i) From SPSS v25 with R 3.3.3 [(File/Open/Script (Python >3.10 is crucial)], and (ii) From RStudio with R >4.3 and Python >3.10 (SPSS just for the path to your dataset). The comparison between the two PSM procedures (SPSS via R packages and Python-based script) has shown that, despite the differences in the matched pairs, the SMDs are similar / identical in terms of the variables included in the PS. As example, a dotplot obtained using PSM for SPSS v1.0 and the lineplot obtained using our Python-based script. Furthermore, the fact that the results match perfectly validates our approach and shows that the Python implementation is reliable.

Comparison between Python-based code and PSM performed by SPSS (based on R packages) The code was also tested by comparing the results with those of a PSM in SPSS based on R packages (Propensity Score Matching for SPSS v1.0, by Thoemmes F). We selected certain characteristics (5 covariates to be included in the PS, caliper=0.20, ratio 1/1, sampling without replacement), and applied them to both methods. We observed significant discrepancies in the PS values and in the composition of the matched sample. However, the post-matching balance met the standard thresholds using both methods. The differences are probably due to several factors -PS estimation, optimisation algorithms, caliper application...- reflecting the different performances offered by Python libraries (matplotlib, sklearn) and R-based packages (MatchIt, RItools, cem). In our opinion, in a practical approach, a method could be considered acceptable if the balance after matching meets the key criterion of absolute SMD <10% in covariates. This indicates a good PSM model, regardless of the PS values or the composition of the matched pairs.

Balance Assessment Report The version dated 2025/03/12 provides a balance assessment report in a .docx file. It summarises the results of applying SMD, VR, pseudo-R^2 and L1 multivariate imbalance (Iacus, King and Porro, 2009) before and after matching. The overall imbalance was discarded due to lack of broad consensus. L1 is a measure of multivariate imbalance bounded by 0 (perfect balance) and 1 (complete separation in the cross-tabulation). It is desirable that L1 is smaller in the matched sample than in the unmatched sample. The four tests examine all the covariates used to estimate the PS as well as the variables defined by the user as additional covariates. When running these statistics, sampling without replacement, ratio 1:1 and caliper=0.20 are applied by default. The file is an example of the output produced by the script. As commented, the data from the current research (path to your dataset, covariates that will estimate the result, and the folder where the .docx file is stored) must be entered.It is recommended to check if the Python modules are available, especially Python-docx.

Improvements Codes 1 and 3 have been slightly modified. In addition to performing PS matching and saving matched pairs, this latest version (update 2025-02-16) introduces refinements to the lineplots from the matplotlib library. In particular, the visualisation of SMD values has been improved, including validation thresholds. The limits of y-axis are set to [-1,+1] and the tick frequency (tick spacing) is 0.1. Both can be changed in the code lines: plt.ylim(-1, 1) ax.yaxis.set_major_locator(ticker.MultipleLocator(0.1))

Related Organizations

Instituto de Investigación Marqués de Valdecilla
Spain
Servicio Cántabro de Salud
Spain
University of Cantabria
Spain

Keywords

Methods, Matching, Propensity Score, Statistics in Medicine, Python

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Related to Research communities

EUNICE