
Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLMResearch goal: What is the impact of retrieval store size relative to pretraining corpus size on the performance of multimodal models on the LATEX benchmarks under a fixed data budget?Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 8.5/10.
