MCMC Framework for Learning Bayesian Decision Trees: MATLAB Implementation and Benchmarking

# Bayesian Decision Trees MATLAB Package ## DescriptionThis MATLAB package provides tools to benchmark and compare **Bayesian Decision Trees (BDT)** and **Random Forests (RF)** for decision-making under uncertainty caused by missing data. The package, designed for researchers and practitioners in machine learning and statistics, supports experiments on benchmark datasets with configurable missing data types and levels. The main entry point, `experiment_planner.m`, orchestrates data loading, missing data simulation, k-fold cross-validation, and performance reporting. This package is ideal for applications in fields like finance, healthcare, ecology, and agriculture, where robust predictions despite incomplete data are critical. ## Features- **Experiment Setup**: Run single or group experiments comparing BDT and RF on datasets with missing data.- **Datasets**: Includes synthetic (XOR3) and real-world datasets (HEART, CREDIT, LIQ).- **Missing Data Simulation**: Supports NA, MCAR, MAR, and MNAR missingness types at levels from 0 to 50%.- **Performance Metrics**: Evaluates models using accuracy, F1 score, TPR, TNR, and entropy.- **Cross-Validation**: Implements k-fold cross-validation for robust model evaluation.- **Reporting**: Generates JSON reports and statistical summaries of model performance. ## Installation1. **Prerequisites**: - MATLAB with Statistics and Machine Learning Toolbox (for `TreeBagger` in RF). - Ensure helper functions (e.g., `settings_of_methods.m`, `rf_cross_validation.m`, `bdt_cross_validation.m`, `simulate_missing_data.m`, `cv_data_folds.m`, `performance_metrics.m`, `save_BDT_report.m`, `save_RF_report.m`, `report_performance_tests.m`) are in the MATLAB path. - Data files: `heart_failure_clinical_records_dataset.csv` (UCI Heart Failure), `default_of_credit_card_clients.xls` (UCI Credit Card Default), `data_company.csv` (custom company dataset). Place in `data/` directory. - JSON file: `prop_ratio.json` with BDT proposal ratios (`R1`, `R2`). 2. **Steps**: - Clone or download the repository from ZENODO. - Add the repository folder to your MATLAB path: ```matlab addpath('/path/to/repository'); ``` - Verify data files and `prop_ratio.json` are in the correct directory or update paths in `load_data` (in `experiment_planner.m`). ## UsageThe `experiment_planner.m` function is the primary interface for running experiments in two modes:- **Single Experiment**: Compares BDT and RF on a single dataset with specified missingness parameters.- **Group Experiment**: Benchmarks BDT and RF across multiple datasets, missingness types, and levels. ### Single Experiment1. Open `experiment_planner.m` and configure the `p` struct: ```matlab p.bench_index = 1; % 0: XOR3, 1: HEART, 2: CREDIT, 3: LIQ p.mis_type = 3; % 0: NA, 1: MCAR, 2: MAR, 3: MNAR p.mis_lev = 0.1; % 10% missingness p.nf = 7; % 7-fold cross-validation p.group = 0; % Single experiment mode ```2. Run: ```matlab experiment_planner(); ```3. **Output**: - JSON reports (`reports/report_BDT_000.json`, `reports/report_RF_000.json`). - Statistical test results. ### Group Experiment1. Set `p.group = 1` in `experiment_planner.m`.2. Run: ```matlab experiment_planner(); ```3. **Output**: - JSON reports for each combination of dataset, missingness type (NA, MCAR, MAR, MNAR), and level (0.1, 0.25, except NA at 0). - Statistical comparisons. ### Summarizing ResultsRun:```matlabsummary_of_cross_validation_benchmarking();```This generates `summary_of_cross_validation_benchmarking.txt`, comparing BDT and RF performance (accuracy, F1, entropy) with statistical significance (p < 0.05). ## Data- **XOR3**: Synthetic dataset (1000 samples, 3 features: X1, X2 for XOR, X3 dummy).- **HEART**: UCI Heart Failure Clinical Records (CSV).- **CREDIT**: UCI Default of Credit Card Clients (Excel).- **LIQ**: Custom company dataset (CSV, semicolon-separated).Update `load_data` in `experiment_planner.m` if data paths differ. ## ConfigurationThe `p` struct in `experiment_planner.m` controls:- `p.bench_index`: Dataset (0: XOR3, 1: HEART, 2: CREDIT, 3: LIQ).- `p.mis_type`: Missingness (0: NA, 1: MCAR, 2: MAR, 3: MNAR).- `p.mis_lev`: Missingness level (0–0.5).- `p.nf`: Number of cross-validation folds.- `p.group`: Mode (0: single, 1: group).- Method settings (in `settings_of_methods.m`): - **BDT**: MCMC parameters (`nb` burn-in, `np` post-burn-in, `Pr` proposal probabilities). - **RF**: TreeBagger parameters (`nTrees`, `br` bootstrap ratio, `vr` variable sampling ratio). ## Outputs- **JSON Reports**: Stored in `reports/` as `report_BDT_XXX.json` and `report_RF_XXX.json`, detailing metrics per fold.- **Summary File**: Statistical comparisons in `summary_of_cross_validation_benchmarking.txt`. ## Troubleshooting- **File Not Found**: Check data file paths and `prop_ratio.json`.- **Missing Functions**: Ensure all helper functions are in the MATLAB path.- **TreeBagger Errors**: Verify Statistics and Machine Learning Toolbox installation.- **Memory Issues**: Large datasets or high `nTrees`/`nb` values may require optimization. ## Extending the Package- **New Datasets**: Add cases to `load_data` and update `settings_of_methods.m`.- **New Metrics**: Modify `rf_cross_validation.m` and `bdt_cross_validation.m`, then update `summary_of_cross_validation_benchmarking.m`. ## LicenseLicensed under the MIT License. See `LICENSE` file for details. ## ContactFor issues or contributions, open a ZENODO issue or email vitaly.schetinin@gmail.com.

Related Organizations

University of Bedfordshire
United Kingdom

Keywords

Missing data, Data Mining/methods, Bayesian Machine Learning

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average