
Turbulence is an innovative benchmark designed to systematically evaluate the correctness and robustness of instruction-tuned large language models (LLMs) for code generation. It features a comprehensive set of natural language question templates, each representing a programming problem, parameterised to produce numerous variations. These variations form a "neighbourhood" of closely related programming questions, enabling the evaluation of an LLM's ability to generalise across semantically similar but non-equivalent tasks. Each template is equipped with a test oracle that automatically verifies the correctness of the code generated by the LLM. The benchmark identifies robustness issues by detecting cases where an LLM successfully solves some variations in a neighbourhood but fails to generalise to others. This approach provides a detailed and systematic analysis of model performance. In this release, five prominent LLMs were evaluated using the Turbulence benchmark: GPT-4, GPT-3.5-turbo, Command, CodeLlama:7B:4-bit-quantised, CodeLlama:13B:4-bit-quantised. These models were assessed on their ability to generate correct and robust code solutions across a large neighbourhood of programming questions.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
