Reproducibility Meta-Analysis of Divergent GPT-4o SWE-bench Performance Driven by Evaluation Protocol Discrepancies

SOVEREIGN Research Kernel

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Report

Data sources: ZENODO

Reproducibility Meta-Analysis of Divergent GPT-4o SWE-bench Performance Driven by Evaluation Protocol Discrepancies

descriptionPublicationkeyboard_double_arrow_right Report Under curation English Publisher:Zenodo

Authors: SOVEREIGN Research Kernel;

doi: 10.5281/zenodo.20636331

Reproducibility Meta-Analysis of Divergent GPT-4o SWE-bench Performance Driven by Evaluation Protocol Discrepancies

- Summary

Abstract

As Large Language Models (LLMs) become increasingly integrated into secure software development workflows, a critical question remains unanswered: can these models not only detect insecure code but also reliably classify vulnerabilities according to standardized taxonomies? In this work, we conduct a systematic evaluation of three state-of-the-art LLMs - Llama3, Codestral, and Deepseek R1 - using a carefully filtered subset of the Big-Vul dataset annotated with eight representative Common Weakness Enumeration categories. Adopting a closed-world classification setup, we assess each model's perfResearch goal: Reproducibility meta-analysis: 2 independent publications report divergent GPT-4o performance on SWE-bench with a 76.4 percentage-point spread (range 7.0%–83.4%). Source papers: "SWE-bench Goes Live!" (2025, 7.0%); "FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driv…" (2025, 83.4%). Preliminary analysis suggests: The extreme discrepancy likely stems from the 83.4% score reflecting a fine-tuned or agentic variant of GPT-4o evaluated under a permissive, multi-turn feedback loop with access to external tools, whereas the 7.0% figure represents the base model's performance in a strict, zero-shot, single-turn setting without execut… Systematically evaluate which evaluation protocol factors (model configuration, inference setup, quantization, tokenization, few-shot count, metric interpretation, or data-split selection) best explain the observed spread; identify the highest-confidence explanation supported by each paper's stated methodology; and assess whether the highest-reported score is reproducible under the conditions described by the lowest-reporting paper.Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.0/10.

Found an issue? Give us feedback