
This report synthesises findings from 12 peer-reviewed papers addressing the following research question: How does the confidence calibration method applied in this study generalize to other vulnerability classification benchmarks like CWE-1000, and what is the accuracy trade-off when scaling to more. 7 claims were extracted from source literature; 7 were independently verified against retrieved documents. An automated multi-reviewer quality assessment produced a score of 9.0/10. This report is a machine-generated literature synthesis and does not constitute original research.Research goal: How does the confidence calibration method applied in this study generalize to other vulnerability classification benchmarks like CWE-1000, and what is the accuracy trade-off when scaling to more fine-grained taxonomy labels?Autonomous literature synthesis. Automated review score: 9.0/10. Full text and citation available at Assignee Research.
