Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Other literature type . 2025
License: CC BY
Data sources: ZENODO
ZENODO
Report . 2025
License: CC BY
Data sources: Datacite
addClaim

Do We Trust the Explanation? Exploring Disagreement in Post-hoc XAI for Defect Prediction

Do We Trust the Explanation? Exploring Disagreement in Post-hoc XAI for Defect Prediction

Abstract

Machine learning (ML)-based defect prediction models can help practitioners to identify bug-prone modules in large software projects. Identifying bug-prone modules improves resource allocation, lowers maintenance costs, raises software quality, and ensures dependable secure products. However, such defect predictors might not be accepted by practitioners due to a lack of interpretability. Therefore, post-hoc explanation methods such as LIME, SHAP, and BreakDown have gained popularity. These explanation techniques offer insights into the decision-making of ML models by ranking features in order of importance. However, the post-hoc explainability of ML techniques is novel to the Software Engineering (SE) Community; hence, it is unclear whether such methods help practitioners make better decisions regarding software maintenance. Furthermore, recent user studies show that data scientists often employ multiple post-hoc explainers to understand the decision of a single model due to a lack of ground truth datasets. The different techniques approximate the behavior of the model to explain the causes of the disagreement, and because of this disagreement, the usage of the post-hoc method is often confusing for practitioners. In this study, we first investigate disagreements among three explainers: LIME, SHAP, and BreakDown for software defect prediction. Second, we attempt to identify the types of disagreement that occur more frequently than others. Third, we propose an aggregation method that helps to reduce disagreement. Finally, we surveyed 74 practitioners to hear whether they agreed with our findings. The proposition is to aggregate post-hoc explanations, reducing emphasis on disagreements and highlighting areas of agreement. According to the survey responses, 90\% (approximately) of the participants confirmed the validation of the proposed aggregation method. Our novel method bridges the gap of disagreements, opening doors for software engineers to extract valuable insights from multiple explanations.

Keywords

Software Maintenance, Defect Prediction, SHAP, BreakDown, Software Engineering, LIME, Empirical Research, Explainability

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Green