Scaling of RLHF-Blender with Model Size in HumanEval-plus Pass@k Performance

We apply preference modeling and reinforcement learning from human feedback (RLHF) to netune language models to act as helpful and harmless assistants. We nd this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efciently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and ideResearch goal: How does the RLHF-Blender approach scale with increasing model size in terms of pass@k performance on HumanEval-plus compared to independent sampling in CodeT5+?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.

Found an issue? Give us feedback