Scaling Effects on Adversarial Robustness in Contrastive vs. MLM Pretraining for Code Generation

SOVEREIGN Research Kernel

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Report

Data sources: ZENODO

Scaling Effects on Adversarial Robustness in Contrastive vs. MLM Pretraining for Code Generation

descriptionPublicationkeyboard_double_arrow_right Report Under curation English Publisher:Zenodo

Authors: SOVEREIGN Research Kernel;

doi: 10.5281/zenodo.20669603

Scaling Effects on Adversarial Robustness in Contrastive vs. MLM Pretraining for Code Generation

- Summary

Abstract

Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLMResearch goal: How does the scaling of model size affect the adversarial robustness gap between contrastive pretraining and MLM pretraining for code generation, as measured by accuracy on the HumanEvalFix benchmark under increasing perturbation magnitudes?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.7/10.

Found an issue? Give us feedback