Do We Trust the Explanation? Exploring Disagreement in Post-hoc XAI for Defect Prediction

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Other literature type . 2025

License: CC BY

Data sources: ZENODO

ZENODO

Report . 2025

License: CC BY

Data sources: Datacite

Do We Trust the Explanation? Exploring Disagreement in Post-hoc XAI for Defect Prediction

descriptionPublicationkeyboard_double_arrow_right Report , Other literature type 01 Feb 2025 English Publisher:Zenodo

doi: 10.5281/zenodo.14783915

Do We Trust the Explanation? Exploring Disagreement in Post-hoc XAI for Defect Prediction

- Summary
- Subjects
- Metrics

Abstract

Machine learning (ML)-based defect prediction models can help practitioners to identify bug-prone modules in large software projects. Identifying bug-prone modules improves resource allocation, lowers maintenance costs, raises software quality, and ensures dependable secure products. However, such defect predictors might not be accepted by practitioners due to a lack of interpretability. Therefore, post-hoc explanation methods such as LIME, SHAP, and BreakDown have gained popularity. These explanation techniques offer insights into the decision-making of ML models by ranking features in order of importance. However, the post-hoc explainability of ML techniques is novel to the Software Engineering (SE) Community; hence, it is unclear whether such methods help practitioners make better decisions regarding software maintenance. Furthermore, recent user studies show that data scientists often employ multiple post-hoc explainers to understand the decision of a single model due to a lack of ground truth datasets. The different techniques approximate the behavior of the model to explain the causes of the disagreement, and because of this disagreement, the usage of the post-hoc method is often confusing for practitioners. In this study, we first investigate disagreements among three explainers: LIME, SHAP, and BreakDown for software defect prediction. Second, we attempt to identify the types of disagreement that occur more frequently than others. Third, we propose an aggregation method that helps to reduce disagreement. Finally, we surveyed 74 practitioners to hear whether they agreed with our findings. The proposition is to aggregate post-hoc explanations, reducing emphasis on disagreements and highlighting areas of agreement. According to the survey responses, 90\% (approximately) of the participants confirmed the validation of the proposed aggregation method. Our novel method bridges the gap of disagreements, opening doors for software engineers to extract valuable insights from multiple explanations.

Keywords

Software Maintenance, Defect Prediction, SHAP, BreakDown, Software Engineering, LIME, Empirical Research, Explainability

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green