Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 31 Mar 2025Embargo end date: 01 Jan 2023Publisher:IEEEJournal:2025 IEEE Conference on Software Testing, Verification and Validation (ICST)

Authors: Shahin Honarvar; Mark van der Wilk; Alastair F. Donaldson;

doi: 10.1109/icst62969.2025.10989005 , 10.48550/arxiv.2312.14856

arXiv: 2312.14856

Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

We present a method for systematically evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code generation via a new benchmark, Turbulence. Turbulence consists of a large set of natural language $\textit{question templates}$, each of which is a programming problem, parameterised so that it can be asked in many different forms. Each question template has an associated $\textit{test oracle}$ that judges whether a code solution returned by an LLM is correct. Thus, from a single question template, it is possible to ask an LLM a $\textit{neighbourhood}$ of very similar programming questions, and assess the correctness of the result returned for each question. This allows gaps in an LLM's code generation abilities to be identified, including $\textit{anomalies}$ where the LLM correctly solves $\textit{almost all}$ questions in a neighbourhood but fails for particular parameter instantiations. We present experiments against five LLMs from OpenAI, Cohere and Meta, each at two temperature configurations. Our findings show that, across the board, Turbulence is able to reveal gaps in LLM reasoning ability. This goes beyond merely highlighting that LLMs sometimes produce wrong code (which is no surprise): by systematically identifying cases where LLMs are able to solve some problems in a neighbourhood but do not manage to generalise to solve the whole neighbourhood, our method is effective at highlighting $\textit{robustness}$ issues. We present data and examples that shed light on the kinds of mistakes that LLMs make when they return incorrect code results.

Updated to the ICST2025 conference version

Related Organizations

Department of Computer Science University of Oxford
United Kingdom
Imperial College London
United Kingdom
Department of Computer Science
Spain
University of Oxford
UNIVERSITY OF OXFORD

View all View all

Keywords

Software Engineering (cs.SE), FOS: Computer and information sciences, Computer Science - Software Engineering, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence

1 Research products, page 1 of 1

copilot software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average