Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Report
Data sources: ZENODO
addClaim

Reproducibility Meta-Analysis of Divergent GPT-4o SWE-bench Performance Driven by Evaluation Protocol Discrepancies

Authors: SOVEREIGN Research Kernel;

Reproducibility Meta-Analysis of Divergent GPT-4o SWE-bench Performance Driven by Evaluation Protocol Discrepancies

Abstract

As Large Language Models (LLMs) become increasingly integrated into secure software development workflows, a critical question remains unanswered: can these models not only detect insecure code but also reliably classify vulnerabilities according to standardized taxonomies? In this work, we conduct a systematic evaluation of three state-of-the-art LLMs - Llama3, Codestral, and Deepseek R1 - using a carefully filtered subset of the Big-Vul dataset annotated with eight representative Common Weakness Enumeration categories. Adopting a closed-world classification setup, we assess each model's perfResearch goal: Reproducibility meta-analysis: 2 independent publications report divergent GPT-4o performance on SWE-bench with a 76.4 percentage-point spread (range 7.0%–83.4%). Source papers: "SWE-bench Goes Live!" (2025, 7.0%); "FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driv…" (2025, 83.4%). Preliminary analysis suggests: The extreme discrepancy likely stems from the 83.4% score reflecting a fine-tuned or agentic variant of GPT-4o evaluated under a permissive, multi-turn feedback loop with access to external tools, whereas the 7.0% figure represents the base model's performance in a strict, zero-shot, single-turn setting without execut… Systematically evaluate which evaluation protocol factors (model configuration, inference setup, quantization, tokenization, few-shot count, metric interpretation, or data-split selection) best explain the observed spread; identify the highest-confidence explanation supported by each paper's stated methodology; and assess whether the highest-reported score is reproducible under the conditions described by the lowest-reporting paper.Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.0/10.

Powered by OpenAIRE graph
Found an issue? Give us feedback