GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries

Name: GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries
Keywords: FOS: Computer and information sciences, 68T50, Software Engineering, COMPUTER SCIENCE, PYTHON, INFORMÁTICA, Software Engineering (cs.SE), Artificial Intelligence (cs.AI), I.2.2; I.2.7; D.2.3, CODE GENERATION

Nuno Fachada; Daniel Fernandes; Carlos M. Fernandes; Bruno D. Ferreira-Saraiva; João P. Matos-Carvalho

Found an issue? Give us feedback

Future Internetarrow_drop_down

Future Internet

Article . 2025 . Peer-reviewed

License: CC BY

Data sources: Crossref

arXiv.org e-Print Archive

Preprint . 2025

Data sources: arXiv.org e-Print Archive

https://dx.doi.org/10.48550/ar...

Article . 2025

License: CC BY

Data sources: Datacite

DBLP

Article

Data sources: DBLP

ReCiL - Repositório Científico Lusófona

Article . 2025

Data sources: ReCiL - Repositório Científico Lusófona

GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 08 Sep 2025Embargo end date: 01 Jan 2025 Portugal English Publisher:MDPI AGJournal:Future Internet, volume 17, page 412 (eissn: 1999-5903,

Copyright policy )

Authors: Nuno Fachada; Daniel Fernandes; Carlos M. Fernandes; Bruno D. Ferreira-Saraiva; João P. Matos-Carvalho;

doi: 10.3390/fi17090412 , 10.48550/arxiv.2508.00033

arXiv: 2508.00033

handle: 10437/15547

GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

Large language models (LLMs) have advanced rapidly as tools for automating code generation in scientific research, yet their ability to interpret and use unfamiliar Python APIs for complex computational experiments remains poorly characterized. This study systematically benchmarks a selection of state-of-the-art LLMs in generating functional Python code for two increasingly challenging scenarios: conversational data analysis with the ParShift library, and synthetic data generation and clustering using pyclugen and scikit-learn. Both experiments use structured, zero-shot prompts specifying detailed requirements but omitting in-context examples. Model outputs are evaluated quantitatively for functional correctness and prompt compliance over multiple runs, and qualitatively by analyzing the errors produced when code execution fails. Results show that only a small subset of models consistently generate correct, executable code. GPT-4.1 achieved a 100% success rate across all runs in both experimental tasks, whereas most other models succeeded in fewer than half of the runs, with only Grok-3 and Mistral-Large approaching comparable performance. In addition to benchmarking LLM performance, this approach helps identify shortcomings in third-party libraries, such as unclear documentation or obscure implementation bugs. Overall, these findings highlight current limitations of LLMs for end-to-end scientific automation and emphasize the need for careful prompt design, comprehensive library documentation, and continued advances in language model capabilities.

Country

Portugal

Related Organizations

University of Lisbon
Portugal
Universidade Lusófona
Portugal

Keywords

FOS: Computer and information sciences, 68T50, Software Engineering, COMPUTER SCIENCE, PYTHON, INFORMÁTICA, Software Engineering (cs.SE), Artificial Intelligence (cs.AI), I.2.2; I.2.7; D.2.3, CODE GENERATION, Artificial Intelligence, Computation and Language, CRIAÇÃO DE CÓDIGO, Computation and Language (cs.CL)

1 Research products, page 1 of 1

Supplementary material for "GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries"
2025IsSupplementedBy

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

gold

Related to Research communities

Digital Humanities and Cultural Heritage

Knowmad Institut

GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries

GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries

1 Research products, page 1 of 1

Supplementary material for "GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries"