Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Report
Data sources: ZENODO
addClaim

Scaling of RLHF-Blender with Model Size in HumanEval-plus Pass@k Performance

Authors: SOVEREIGN Research Kernel;

Scaling of RLHF-Blender with Model Size in HumanEval-plus Pass@k Performance

Abstract

We apply preference modeling and reinforcement learning from human feedback (RLHF) to netune language models to act as helpful and harmless assistants. We nd this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efciently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and ideResearch goal: How does the RLHF-Blender approach scale with increasing model size in terms of pass@k performance on HumanEval-plus compared to independent sampling in CodeT5+?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.

Powered by OpenAIRE graph
Found an issue? Give us feedback