PyHDL-Eval: An LLM Evaluation Framework for Hardware Design Using Python-Embedded DSLs

descriptionPublicationkeyboard_double_arrow_right Article , Conference object 09 Sep 2024Publisher:IEEEJournal:2024 ACM/IEEE 6th Symposium on Machine Learning for CAD (MLCAD)

Authors: Christopher Batten; Nathaniel Pinckney; Mingjie Liu; Haoxing Ren; Brucek Khailany;

doi: 10.1109/mlcad62225.2024.10740201 , 10.1145/3670474.3685948 , 10.5281/zenodo.13117553 , 10.5281/zenodo.13117552

PyHDL-Eval: An LLM Evaluation Framework for Hardware Design Using Python-Embedded DSLs

- Summary
- Metrics

Abstract

There has been a recent trend towards embedding hardware design and verification frameworks within Python to improve the productivity of hardware engineers. At the same time, there is significant recent work exploring the use of large-language models (LLMs) to improve key chip design and verification tasks. All of this prior work has focused on LLMs in the context of traditional hardware description languages. This paper describes PyHDL-Eval, a new framework for evaluating LLMs on specification-to-RTL tasks in the context of Python-embedded DSLs. The framework includes 168 problems developed using an ontological approach to cover 19 categories of RTL design. The framework also includes Verilog reference solutions, Verilog test benches, Python test scripts, and workflow orchestration scripts. We use our framework to conduct a detailed case study comparing five LLMs (CodeGemma 7B, Llama3 8B/70B, GPT4, and GPT4 Turbo) targeting Verilog and five Python-embedded DSLs (PyMTL3, PyRTL, MyHDL, Migen, and Amaranth). Our results demonstrate the promise of in-context learning (ICL) when applied to smaller models (e.g., pass rate for CodeGemma 7B improves from 14.9% to 32.7% on Verilog) and Python-embedded DSLs (e.g., pass rate for LLama3 70B improves from 0.6% to 33.0% on PyMTL3). We find LLMs perform equally well or better when targeting Verilog as compared Python-embedded DSLs (e.g., pass rate for GPT4 Turbo is 72.3% on Verilog and 30.0-62.2% on the Python-embedded DSLs), even though they are embedded within a popular general-purpose host language. PyHDL-Eval will serve as a useful framework to drive continued research at the intersection of Python-embedded DSLs and LLMs. The attached Docker image includes everything required to reproduce all of the results in the paper: Source code for the PyHDL-Eval framework (Verilog reference solutions, Verilog test benches, Python test scripts, workflow orchestration scripts) Pre-installed binaries for all tools (GCC 13.2.0, Make 4.3, Icarus Verilog simulator 12.0, Verilator Verilog simulator 5.020, Python 3.12.3) Pre-installed Python packages for all five Python-embedded DSLs (PyMTL3, PyRTL 0.11.1, MyHDL 0.11.45, Migen 0.9.2, Amaranth 0.4.5) RTL modules pre-generated using all five LLMs (CodeGemma 7B, Llama3 8B/70B, GPT4, GPT4 Turbo) Please refer to the README file for how to load the Docker image, test the framework, run all of the simulations, and then generate the result data tables.

Related Organizations

Cornell University
United States
Nvidia (United States)
United States

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	7
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

7

Top 10%

Green