
Machine learning (ML)-based defect prediction models can help practitioners to identify bug-prone modules in large software projects. Identifying bug-prone modules improves resource allocation, lowers maintenance costs, raises software quality, and ensures dependable secure products. However, such defect predictors might not be accepted by practitioners due to a lack of interpretability. Therefore, post-hoc explanation methods such as LIME, SHAP, and BreakDown have gained popularity. These explanation techniques offer insights into the decision-making of ML models by ranking features in order of importance. However, the post-hoc explainability of ML techniques is novel to the Software Engineering (SE) Community; hence, it is unclear whether such methods help practitioners make better decisions regarding software maintenance. Furthermore, recent user studies show that data scientists often employ multiple post-hoc explainers to understand the decision of a single model due to a lack of ground truth datasets. The different techniques approximate the behavior of the model to explain the causes of the disagreement, and because of this disagreement, the usage of the post-hoc method is often confusing for practitioners. In this study, we first investigate disagreements among three explainers: LIME, SHAP, and BreakDown for software defect prediction. Second, we attempt to identify the types of disagreement that occur more frequently than others. Third, we propose an aggregation method that helps to reduce disagreement. Finally, we surveyed 74 practitioners to hear whether they agreed with our findings. The proposition is to aggregate post-hoc explanations, reducing emphasis on disagreements and highlighting areas of agreement. According to the survey responses, 90\% (approximately) of the participants confirmed the validation of the proposed aggregation method. Our novel method bridges the gap of disagreements, opening doors for software engineers to extract valuable insights from multiple explanations.
Software Maintenance, Defect Prediction, SHAP, BreakDown, Software Engineering, LIME, Empirical Research, Explainability
Software Maintenance, Defect Prediction, SHAP, BreakDown, Software Engineering, LIME, Empirical Research, Explainability
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
