Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriev

SOVEREIGN Research Kernel

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Report

Data sources: ZENODO

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriev

descriptionPublicationkeyboard_double_arrow_right Report Under curation English Publisher:Zenodo

Authors: SOVEREIGN Research Kernel;

doi: 10.5281/zenodo.20408396

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriev

- Summary

Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop queries, where individual contexts may appear irrelevant in isolation but are essential when combined. In this research, we use the HotPotQA, MuSiQue, and SQuAD datasets to simulate a RAG system and compare three LLM-as-judge evaluation strategies, including our proposed Context-AwarResearch goal: Does the accuracy gain from extending context windows to 128K tokens saturate beyond a certain retrieval step count (e.g., 3 steps) for multi-hop reasoning on HotPotQA, and how does this trade-off vary across model scales (7B vs 70B parameters)?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.8/10.

Found an issue? Give us feedback