AgentBench: Evaluating LLMs as Agents

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 01 Jan 2023Embargo end date: 01 Jan 2023Journal:CoRR, volume abs/2308.03688

Authors: Xiao Liu 0036; Hao Yu 0030; Hanchen Zhang; Yifan Xu; Xuanyu Lei; Hanyu Lai; Yu Gu 0016; +15 Authors

doi: 10.48550/arxiv.2308.03688

arXiv: 2308.03688

AgentBench: Evaluating LLMs as Agents

- Summary
- Subjects
- Metrics

Abstract

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following and training on high quality multi-round alignment data could improve agent performance. And different from existing assumptions, training on code present ambivalent impacts on different agent tasks. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench.

Published in ICLR 2024

Related Organizations

View all View all

Keywords

Machine Learning, FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Artificial Intelligence, Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

2

Average

Green

Related to Research communities

Knowmad Institut