Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Report
Data sources: ZENODO
addClaim

Impact of Multi-Hop QA Benchmark Choice on RAG Retriever Evaluation via F1 Score Analysis

Authors: SOVEREIGN Research Kernel;

Impact of Multi-Hop QA Benchmark Choice on RAG Retriever Evaluation via F1 Score Analysis

Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop queries, where individual contexts may appear irrelevant in isolation but are essential when combined. In this research, we use the HotPotQA, MuSiQue, and SQuAD datasets to simulate a RAG system and compare three LLM-as-judge evaluation strategies, including our proposed Context-AwarResearch goal: Does the choice of multi-hop QA benchmark (HotPotQA vs. MuSiQue vs. SQuAD) significantly affect the evaluation of RAG retriever strategies, and how can this be measured via F1 score differences across datasets?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.4/10.

Powered by OpenAIRE graph
Found an issue? Give us feedback