<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

Low-Cost and Comprehensive Non-textual Input Fuzzing with LLM-Synthesized Input Generators

Name: Low-Cost and Comprehensive Non-textual Input Fuzzing with LLM-Synthesized Input Generators
Keywords: Software Engineering (cs.SE), FOS: Computer and information sciences, Computer Science - Software Engineering

descriptionPublicationkeyboard_double_arrow_right Article , Conference object , Preprint 01 Jan 2025Embargo end date: 01 Jan 2025Publisher:Zenodo

Authors: Zhang, Kunpeng; Li, Zongjie; Wu, Daoyuan; Wang, Shuai; Xia, Xin;

doi: 10.5281/zenodo.14728879 , 10.5281/zenodo.14728878 , 10.48550/arxiv.2501.19282

arXiv: http://arxiv.org/abs/2501.19282

Low-Cost and Comprehensive Non-textual Input Fuzzing with LLM-Synthesized Input Generators

- Summary
- Subjects
- Metrics

Abstract

Modern software often accepts inputs with highly complex grammars. Recent advances in large language models (LLMs) have shown that they can be used to synthesize high-quality natural language text and code that conforms to the grammar of a given input format. Nevertheless, LLMs are often incapable or too costly to generate non-textual outputs, such as images, videos, and PDF files. This limitation hinders the application of LLMs in grammar-aware fuzzing. We present a novel approach to enabling grammar-aware fuzzing over non-textual inputs. We employ LLMs to synthesize and also mutate input generators, in the form of Python scripts, that generate data conforming to the grammar of a given input format. Then, non-textual data yielded by the input generators are further mutated by traditional fuzzers (AFL++) to explore the software input space effectively. Our approach, namely G2FUZZ, features a hybrid strategy that combines a holistic search driven by LLMs and a local search driven by industrial quality fuzzers. Two key advantages are: (1) LLMs are good at synthesizing and mutating input generators and enabling jumping out of local optima, thus achieving a synergistic effect when combined with mutation-based fuzzers; (2) LLMs are less frequently invoked unless really needed, thus significantly reducing the cost of LLM usage. We have evaluated G2FUZZ on a variety of input formats, including TIFF images, MP4 audios, and PDF files. The results show that G2FUZZ outperforms SOTA tools such as AFL++, Fuzztruction, and FormatFuzzer in terms of code coverage and bug finding across most programs tested on three platforms: UNIFUZZ, FuzzBench, and MAGMA.

USENIX Security 2025

Keywords

Software Engineering (cs.SE), FOS: Computer and information sciences, Computer Science - Software Engineering

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

Green