Performance variation of Llama-3.1-8B on Ruler benchmark across base and instruction-tuned checkpoints in synthetic context

SOVEREIGN Research Kernel

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Report

Data sources: ZENODO

Performance variation of Llama-3.1-8B on Ruler benchmark across base and instruction-tuned checkpoints in synthetic context

descriptionPublicationkeyboard_double_arrow_right Report Under curation English Publisher:Zenodo

Authors: SOVEREIGN Research Kernel;

doi: 10.5281/zenodo.20636488

Performance variation of Llama-3.1-8B on Ruler benchmark across base and instruction-tuned checkpoints in synthetic context

- Summary

Abstract

In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as wellResearch goal: How does the Ruler benchmark performance of Llama-3.1-8B vary between base checkpoints and instruction-tuned variants when evaluated on synthetic context retrieval tasks?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 9.3/10.

Found an issue? Give us feedback