Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators

Name: Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators
Keywords: FOS: Computer and information sciences, Knowledge intensive tasks, Databases and Information Systems, Computer Science - Computation and Language, Informativeness, Evaluation framework, Information Security, Empirical analysis, Down-stream, Language model

Liang Chen 0001; Yang Deng 0002; Yatao Bian; Zeyu Qin; Bingzhe Wu; Tat-Seng Chua; Kam-Fai Wong

Found an issue? Give us feedback

downloadFull-Text

Institutional Knowle...arrow_drop_down

Institutional Knowledge (InK) at Singapore Management University

Article . 2023

License: CC BY NC ND

Full-Text: https://ink.library.smu.edu.sg/sis_research/9117

Data sources: Bielefeld Academic Search Engine (BASE)

arXiv.org e-Print Archive

Preprint . 2023

Data sources: arXiv.org e-Print Archive

https://doi.org/10.18653/v1/20...

Article . 2023 . Peer-reviewed

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2023

License: arXiv Non-Exclusive Distribution

Data sources: Datacite

DBLP

Article

Data sources: DBLP

DBLP

Conference object

Data sources: DBLP

Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 01 Jan 2023Embargo end date: 01 Jan 2023 Singapore Publisher:Association for Computational Linguistics (ACL)Journal:Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Authors: Liang Chen 0001; Yang Deng 0002; Yatao Bian; Zeyu Qin; Bingzhe Wu; Tat-Seng Chua; Kam-Fai Wong;

doi: 10.18653/v1/2023.emnlp-main.390 , 10.48550/arxiv.2310.07289

arXiv: 2310.07289

Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators

- Summary
- Subjects
- Related research
  (4)
- Metrics

Abstract

Large language models (LLMs) outperform information retrieval techniques for downstream knowledge-intensive tasks when being prompted to generate world knowledge. However, community concerns abound regarding the factuality and potential implications of using this uncensored knowledge. In light of this, we introduce CONNER, a COmpreheNsive kNowledge Evaluation fRamework, designed to systematically and automatically evaluate generated knowledge from six important perspectives -- Factuality, Relevance, Coherence, Informativeness, Helpfulness and Validity. We conduct an extensive empirical analysis of the generated knowledge from three different types of LLMs on two widely studied knowledge-intensive tasks, i.e., open-domain question answering and knowledge-grounded dialogue. Surprisingly, our study reveals that the factuality of generated knowledge, even if lower, does not significantly hinder downstream tasks. Instead, the relevance and coherence of the outputs are more important than small factual mistakes. Further, we show how to use CONNER to improve knowledge-intensive tasks by designing two strategies: Prompt Engineering and Knowledge Selection. Our evaluation code and LLM-generated knowledge with human annotations will be released to facilitate future research.

Accepted to EMNLP 2023 main conference

Country

Singapore

Related Organizations

Hong Kong Polytechnic University
China (People's Republic of)
Hong Kong University of Science and Technology
China (People's Republic of)
National University of Singapore
Singapore
The Chinese University of Hong Kong
Hong Kong
Chinese University of Hong Kong
China (People's Republic of)

View all View all

Keywords

FOS: Computer and information sciences, Knowledge intensive tasks, Databases and Information Systems, Computer Science - Computation and Language, Informativeness, Evaluation framework, Information Security, Empirical analysis, Down-stream, Language model, Retrieval techniques, World knowledge, Computation and Language (cs.CL), Comprehensive evaluation, Knowledge evaluations

4 Research products, page 1 of 1

llama software on GitHub
IsRelatedTo
ColBERT-X software on GitHub
IsRelatedTo
dl4marco-bert software on GitHub
IsRelatedTo
CONNER software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	14
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

14

Top 10%

Green

Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators

Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators

4 Research products, page 1 of 1

llama software on GitHub

ColBERT-X software on GitHub

dl4marco-bert software on GitHub

CONNER software on GitHub