BioCoder: a benchmark for bioinformatics code generation with large language models

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type , Preprint 28 Jun 2024Embargo end date: 01 Jan 2023 English Publisher:Oxford University Press (OUP)Journal:Bioinformatics, volume 40, pages i266-i276 (issn: 1367-4803, eissn: 1367-4811,

Copyright policy )

Authors: Xiangru Tang; Bill Qian; Rick Gao; Jiakang Chen; Xinyun Chen; Mark B. Gerstein;

doi: 10.1093/bioinformatics/btae230 , 10.48550/arxiv.2308.16458

pmid: 38940140

pmc: PMC11211839

arXiv: 2308.16458

BioCoder: a benchmark for bioinformatics code generation with large language models

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

Abstract Summary Pretrained large language models (LLMs) have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate LLMs in generating bioinformatics-specific code. BioCoder spans much of the field, covering cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling, we show that the overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate various models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we fine-tuned one model (StarCoder), demonstrating that our training dataset can enhance the performance on our testing benchmark (by >15% in terms of Pass@K under certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (i) Successful models accommodate a long prompt (>2600 tokens) with full context, including functional dependencies. (ii) They contain domain-specific knowledge of bioinformatics, beyond just general coding capability. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on our benchmark (50% versus up to 25%). Availability and implementation All datasets, benchmark, Docker images, and scripts required for testing are available at: https://github.com/gersteinlab/biocoder and https://biocoder-benchmark.github.io/.

Related Organizations

Yale University
Yale University
United States
Yale University
YALE UNIVERSITY
DeepMind (United Kingdom)
United Kingdom

View all View all

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computational Biology, Machine Learning (cs.LG), Artificial Intelligence (cs.AI), General Computational Biology, Programming Languages, Computation and Language (cs.CL), Software, Algorithms

2 Research products, page 1 of 1

biocode software on GitHub
IsRelatedTo
CellProfiler software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	13
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%