MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use

Name: MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use
Keywords: Software Engineering (cs.SE), FOS: Computer and information sciences, Computer Science - Software Engineering, Computer Science - Computation and Language, Computation and Language (cs.CL)

Yue Huang 0001; Jiawen Shi; Yuan Li 0032; Chenrui Fan; Siyuan Wu 0001; Qihui Zhang; Yixin Liu 0002; Pan Zhou 0001; Yao Wan 0001; Neil Zhenqiang Gong; Lichao Sun 0001

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2023

Data sources: arXiv.org e-Print Archive

https://dx.doi.org/10.48550/ar...

Article . 2023

License: CC BY NC SA

Data sources: Datacite

DBLP

Conference object

Data sources: DBLP

DBLP

Article

Data sources: DBLP

MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 01 Jan 2023Embargo end date: 01 Jan 2023Publisher:arXivJournal:CoRR, volume abs/2310.03128

Authors: Yue Huang 0001; Jiawen Shi; Yuan Li 0032; Chenrui Fan; Siyuan Wu 0001; Qihui Zhang; Yixin Liu 0002; +4 Authors

doi: 10.48550/arxiv.2310.03128

arXiv: 2310.03128

MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use

- Summary
- Subjects
- Related research
  (4)
- Metrics

Abstract

Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities. Recently, many studies have focused on the tool utilization ability of LLMs. They primarily investigated how LLMs effectively collaborate with given specific tools. However, in scenarios where LLMs serve as intelligent agents, as seen in applications like AutoGPT and MetaGPT, LLMs are expected to engage in intricate decision-making processes that involve deciding whether to employ a tool and selecting the most suitable tool(s) from a collection of available tools to fulfill user requests. Therefore, in this paper, we introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools. Specifically, we create a dataset called ToolE within the benchmark. This dataset contains various types of user queries in the form of prompts that trigger LLMs to use tools, including both single-tool and multi-tool scenarios. Subsequently, we set the tasks for both tool usage awareness and tool selection. We define four subtasks from different perspectives in tool selection, including tool selection with similar choices, tool selection in specific scenarios, tool selection with possible reliability issues, and multi-tool selection. We conduct experiments involving eight popular LLMs and find that the majority of them still struggle to effectively select tools, highlighting the existing gaps between LLMs and genuine intelligent agents. However, through the error analysis, we found there is still significant room for improvement. Finally, we conclude with insights for tool developers -- we strongly recommend that tool developers choose an appropriate rewrite model for generating new descriptions based on the downstream LLM the tool will apply to. Our code is in https://github.com/HowieHwong/MetaTool.

Related Organizations

View all View all

Keywords

Software Engineering (cs.SE), FOS: Computer and information sciences, Computer Science - Software Engineering, Computer Science - Computation and Language, Computation and Language (cs.CL)

4 Research products, page 1 of 1

ChatGLM2-6B software on GitHub
IsRelatedTo
MetaGPT software on GitHub
IsRelatedTo
babyagi software on GitHub
IsRelatedTo
Map_tools software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use

MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use

4 Research products, page 1 of 1

ChatGLM2-6B software on GitHub

MetaGPT software on GitHub

babyagi software on GitHub

Map_tools software on GitHub