Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Report
Data sources: ZENODO
addClaim

Inference Efficiency of CodeLlama and StarCoder on Self-Invoking HumanEval Pro Versus Original HumanEval Benchmarks

Authors: Assignee Research;

Inference Efficiency of CodeLlama and StarCoder on Self-Invoking HumanEval Pro Versus Original HumanEval Benchmarks

Abstract

We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMsResearch goal: How does the inference efficiency (measured in tokens per second) of CodeLlama and StarCoder vary when solving self-invoking code generation tasks on HumanEval Pro compared to their efficiency on the original HumanEval benchmark?Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 8.3/10.

Powered by OpenAIRE graph
Found an issue? Give us feedback