Leveraging Network Topology for Credit Risk Assessment in P2P Lending: A Comparative Study under the Lens of Machine Learning

OverviewThis deposit provides the materials required to reproduce the empirical workflow, figures, and manuscript source for the study: “Leveraging Network Topology for Credit Risk Assessment in P2P Lending: A Comparative Study under the Lens of Machine Learning.” The study forms Chapter 3 of the doctoral dissertation “Risk Management in Digital Finance: Assessment and Pricing in an Emerging Fintech Era” by Lennart John Baals and is published as: Liu, Y., Baals, L. J., Osterrieder, J., & Hadji-Misheva, B. (2024). Leveraging network topology for credit risk assessment in P2P lending: A comparative study under the lens of machine learning. Expert Systems with Applications, 252, 124100. This deposit contains (i) the LaTeX source of the thesis chapter, (ii) Jupyter notebooks implementing data preprocessing, network construction, model training/evaluation, and explainability analyses, and (iii) figure outputs summarizing descriptive statistics, feature importance, ROC performance, and network-centrality characteristics. Contents of this deposit (file-level summary) Manuscript / thesis chapter source: main_WP3_PhD_Lennart_Baals.tex, bibliography files (e.g., reference_paper_1.bib), and formatting assets (e.g., apa.bst). Jupyter notebooks (analysis pipeline): Preprocessing & feature engineering: 0.1_data_preprocessing.ipynb, 0.2_descriptive_statistics.ipynb Model training and evaluation workflow: 1_2023.01.05 Data_Pre-processing_&_Models_training.ipynb, 2_2023.07.05_Models_analysis.ipynb, 3_2023.06.28 Model Re-training and testing.ipynb, 4_2023.07.05_Models_analysis.ipynb Explainability: 4_2023.06.28 SHAP Explainability.ipynb Additional experimentation / automation: 2023.04.22 SNF P2P Credit Risk Auto ML.ipynb Key figures / outputs (PDF): Descriptive statistics of raw variables (e.g., interest rate, loan amount, borrower characteristics, prior-loan measures): descriptive_stats_raw_data_full (...).pdf Network-centrality descriptive statistics: descriptive_stats_pagerank.pdf, descriptive_stats_betweenness.pdf, descriptive_stats_closeness.pdf, descriptive_stats_katz.pdf, descriptive_stats_authority.pdf, descriptive_stats_hub.pdf Model performance summaries: all_model_roc_curves.pdf Feature importance / model diagnostics: rf_feature_importance.pdf, glm_feature_importance.pdf, dl_feature_importance.pdf, plus best-model summaries such as RF (best_result).pdf, GLM (best_result).pdf, DL (best_result).pdf Methodological summary (what the code produces)The workflow implements a network-enhanced credit risk assessment framework for P2P lending. In brief, the analysis: constructs a borrower/loan similarity graph using origination-time information and derives network representations, extracts multiple centrality measures (e.g., PageRank, betweenness, closeness, Katz, hub/authority) as additional predictors that encode structural information about similarity-based borrower position, trains and compares several machine-learning models for default prediction (including linear baselines and non-linear learners), and evaluates predictive performance using standard classification metrics and ROC curves, complemented by feature-importance and SHAP-based explainability analyses. The included outputs summarize both the distributional characteristics of the raw data and the incremental predictive value of network-topology features across model classes. Data sources and access conditionsThe empirical component relies on loan-level P2P lending data (e.g., platform data such as Bondora and/or comparable sources, depending on the chapter configuration). Redistribution may be restricted by data-provider terms and privacy constraints. This deposit therefore emphasizes code, documentation, and figure outputs. Users intending to fully reproduce all results should obtain the underlying raw data from the original provider(s) under their own access rights and then apply the provided preprocessing and variable mapping steps as documented in the notebooks. Any included data descriptions are intended to facilitate transparent replication while respecting the applicable redistribution constraints. Reproducibility (how to run)A typical reproduction path is: Run the preprocessing notebooks (0.1_data_preprocessing.ipynb, 0.2_descriptive_statistics.ipynb) to generate cleaned features and descriptive tables. Execute the training/evaluation notebooks (1_..., 2_..., 3_..., 4_...) to reproduce model estimation, ROC curves, and feature-importance outputs. Run the explainability notebook (4_2023.06.28 SHAP Explainability.ipynb) to reproduce SHAP summaries and interpretability results. Compile the thesis chapter from main_WP3_PhD_Lennart_Baals.tex (using the included bibliography/style assets) if you wish to regenerate the manuscript PDF. Intended useThis deposit is intended for: replication of the published results (subject to data access constraints), reuse of the similarity-graph + centrality-feature construction approach for other P2P or retail-credit datasets, and benchmarking of network-enhanced models against conventional credit-scoring baselines. Licensing and reuseUnless otherwise noted within individual files, the intent is to enable reuse for academic and non-commercial research with appropriate attribution. If different licenses apply to code vs. manuscript text/figures, this should be reflected in the record license choice.

Related Organizations

University of Twente
Netherlands
Bern University of Applied Sciences
Switzerland

Keywords

Machine Learning, Fintech, P2P Lending, Credit Risk

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Related to Research communities

Netherlands Research Portal