How does the performance of Llama, Mistral, Qwen, and DeepSeek compare on code generation benchmarks like Huma

SOVEREIGN Research Kernel

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Report

Data sources: ZENODO

How does the performance of Llama, Mistral, Qwen, and DeepSeek compare on code generation benchmarks like Huma

descriptionPublicationkeyboard_double_arrow_right Report Under curation English Publisher:Zenodo

Authors: SOVEREIGN Research Kernel;

doi: 10.5281/zenodo.20440774

How does the performance of Llama, Mistral, Qwen, and DeepSeek compare on code generation benchmarks like Huma

- Summary

Abstract

Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks, known as Code LLMs, particularly in code generation that generates source code with LLM from natural language descriptions. This burgeoning field has captured significant interest from both academic researchers and industry professionals due to its practical significance in software development, e.g., GitHub Copilot. Despite the active exploration of LLMs for a variety of code tasks, either from the perspective of natural language processing (NLP) or software engineering (SE) or both, there isResearch goal: How does the performance of Llama, Mistral, Qwen, and DeepSeek compare on code generation benchmarks like HumanEval and MBPP when evaluated using pass@1 and pass@k metrics?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.8/10.

Found an issue? Give us feedback