Impact of Retrieval Store Size Relative to Pretraining Corpus on Multimodal LATEX Benchmark Performance Under Fixed Data Budget

Assignee Research

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Report

Data sources: ZENODO

Impact of Retrieval Store Size Relative to Pretraining Corpus on Multimodal LATEX Benchmark Performance Under Fixed Data Budget

descriptionPublicationkeyboard_double_arrow_right Report Under curation English Publisher:Zenodo

Authors: Assignee Research;

doi: 10.5281/zenodo.20674700

Impact of Retrieval Store Size Relative to Pretraining Corpus on Multimodal LATEX Benchmark Performance Under Fixed Data Budget

- Summary

Abstract

Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLMResearch goal: What is the impact of retrieval store size relative to pretraining corpus size on the performance of multimodal models on the LATEX benchmarks under a fixed data budget?Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 8.5/10.

Found an issue? Give us feedback