Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Report
Data sources: ZENODO
addClaim

How does the accuracy of Tree of Reviews on MuSiQue at 128K context degrade when the number of distractor pass

Authors: SOVEREIGN Research Kernel;

How does the accuracy of Tree of Reviews on MuSiQue at 128K context degrade when the number of distractor pass

Abstract

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM shall enable its users to effortlessly process many originally exhausting tasks -e.g., digesting a long-form document to find answers v.s., directly asking an LLM about it.However, existing realtask-based long-context evaluation benchmarks have a few major shortcomings.For instance, some Needle-in-a-Haystack-like benchmarks are too synthetic, and therefore do not represent the real world usage of LLMs.While some real-task-based benchmarks like Long-Bench avoid this problem, suResearch goal: How does the accuracy of Tree of Reviews on MuSiQue at 128K context degrade when the number of distractor passages is increased from 5 to 20, relative to chain-based retrieval, using Llama-3-128K?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.0/10.

Powered by OpenAIRE graph
Found an issue? Give us feedback