Performance Comparison of Potential-Based and State-Based Reward Functions on MMLU Benchmark

SOVEREIGN Research Kernel

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Report

Data sources: ZENODO

Performance Comparison of Potential-Based and State-Based Reward Functions on MMLU Benchmark

descriptionPublicationkeyboard_double_arrow_right Report Under curation English Publisher:Zenodo

Authors: SOVEREIGN Research Kernel;

doi: 10.5281/zenodo.20663910

Performance Comparison of Potential-Based and State-Based Reward Functions on MMLU Benchmark

- Summary

Abstract

Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL). In this paper, we systematically investigate the mechanics of efficient reasoning for LLMs. For comprehensive evaluation, we advocate for more fine-grained metrics, including length distribution conditioned on correctness and performance across a wide spectrum of token budgets ranResearch goal: How does the performance of potential-based reward functions compare to state-based reward functions on the MMLU benchmark when applied to models ranging from 7B to 70B parameters under fixed computational budgets?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.6/10.

Found an issue? Give us feedback