
This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: Python 3.10 is now the minimum required version. New Benchmarks & Tasks A big wave of new evaluation tasks this release: AIME and MATH500 math reasoning benchmarks by @jannalulu in #3248, #3311 BabiLong and Longbench v2 for long-context evaluation by @jannalulu in #3287, #3338 GraphWalks by @jannalulu in #3377 ZhoBLiMP, BLiMP-NL, TurBLiMP, LM-SynEval, and BHS linguistic benchmarks by @jmichaelov in #3218, #3221, #3219, #3184, #3265 Icelandic WinoGrande by @jmichaelov in #3277 CLIcK Korean benchmark by @shing100 in #3173 MMLU-Redux (generative) and Spanish translation by @luiscosio in #2705 EsBBQ and CaBBQ bias benchmarks by @valleruizf in #3167 EQBench in Spanish and Catalan by @priverabsc in #3168 Anthropic discrim-eval by @Helw150 in #3091 XNLI-VA by @FranValero97 in #3194 Bangla MMLU (Titulm) by @Ismail-Hossain-1 in #3317 HumanEval infilling by @its-alpesh in #3299 CNN-DailyMail 3.0.0 by @preordinary in #3426 Global PIQA and new acc_norm_bytes metric by @baberabb in #3368 Fixes & Improvements Core Changes: Python 3.10 minimum by @jannalulu in #3337 Unpinned datasets library by @baberabb in #3316 BOS token handling: Delegate to tokenizer; add_bos_token now defaults to None by @baberabb in #3347 Renamed LOGLEVEL env var to LMEVAL_LOG_LEVEL to avoid conflicts by @fxmarty-amd in #3418 Resolve duplicate task names with safeguards by @giuliolovisotto in #3394 Task Fixes: Fixed MMLU-Redux to exclude samples without error_type="ok" and display summary table by @fxmarty-amd in #3410, #3406 Fixed AIME answer extraction by @jannalulu in #3353 Fixed LongBench evaluation and group handling by @TimurAysin, @jannalulu in #3273, #3359, #3361 Fixed crows_pairs dataset by @jannalulu in #3378 Fixed Gemma tokenizer add_bos_token not updating by @DarkLight1337 in #3206 Fixed lambada_multilingual_stablelm by @jmichaelov, @HallerPatrick in #3294, #3222 Fixed CodeXGLUE by @gsaltintas in #3238 Pinned correct MMLUSR version by @christinaexyou in #3350 Updated minerva_math by @baberabb in #3259 Backend Fixes: Fixed vLLM import errors when not installed by @fxmarty-amd in #3292 Fixed vLLM data_parallel_size>1 issue by @Dornavineeth in #3303 Resolved deprecated vllm.utils.get_open_port by @DarkLight1337 in #3398 Fixed GPT series model bugs by @zinccat in #3348 Fixed PIL image hashing to use actual bytes by @tboerstad in #3331 Fixed additional_config parsing by @brian-dellabetta in #3393 Fixed batch chunking seed handling with groupby by @slimfrkha in #3047 Fixed no-output error handling by @Oseltamivir in #3395 Replaced deprecated torch_dtype with dtype by @AbdulmalikDS in #3415 Fixed custom task config reading by @SkyR0ver in #3425 Model & Backend Support OpenAI GPT-5 support by @babyplutokurt in #3247 Azure OpenAI support by @zinccat in #3349 Fine-tuned Gemma3 evaluation support by @LearnerSXH in #3234 OpenVINO text2text models by @nikita-savelyevv in #3101 Intel XPU support for HFLM by @kaixuanliu in #3211 Attention head steering support by @luciaquirke in #3279 Leverage vLLM's tokenizer_info endpoint to avoid manual duplication by @m-misiura in #3185 What's Changed Remove trust_remote_code: True from updated datasets by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3213 Add support for evaluating with fine-tuned Gemma3 by @LearnerSXH in https://github.com/EleutherAI/lm-evaluation-harness/pull/3234 Fix add_bos_token not updated for Gemma tokenizer by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3206 remove incomplete compilation instructions, solves #3233 by @ceferisbarov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3242 Update utils.py by @Anri-Lombard in https://github.com/EleutherAI/lm-evaluation-harness/pull/3246 Adding support for OpenAI GPT-5 model by @babyplutokurt in https://github.com/EleutherAI/lm-evaluation-harness/pull/3247 Add xnli_va dataset by @FranValero97 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3194 Add ZhoBLiMP benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3218 Add BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3221 Add TurBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3219 Add LM-SynEval Benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3184 Fix unknown group key to tag in yaml config for lambada_multilingual_stablelm by @HallerPatrick in https://github.com/EleutherAI/lm-evaluation-harness/pull/3222 update minerva_math by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3259 feat: Add CLIcK task by @shing100 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3173 Adds Anthropic/discrim-eval to lm-evaluation-harness by @Helw150 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3091 Add support for OpenVINO text2text generation models by @nikita-savelyevv in https://github.com/EleutherAI/lm-evaluation-harness/pull/3101 Update MMLU-ProX task by @weihao1115 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3174 Support for AIME dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3248 feat(scrolls): delete chat_template from kwargs by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3267 pacify pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3268 Fix codexglue by @gsaltintas in https://github.com/EleutherAI/lm-evaluation-harness/pull/3238 Add BHS benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3265 Add acc_norm metric to BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3272 Add acc_norm metric to ZhoBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3271 Add EsBBQ and CaBBQ tasks by @valleruizf in https://github.com/EleutherAI/lm-evaluation-harness/pull/3167 Add support for steering individual attention heads by @luciaquirke in https://github.com/EleutherAI/lm-evaluation-harness/pull/3279 Add the Icelandic WinoGrande benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3277 Ignore seed when splitting batch in chunks with groupby by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3047 [fix][vllm] Avoid import errors in case vllm is not installed by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3292 Fix LongBench Evaluation by @TimurAysin in https://github.com/EleutherAI/lm-evaluation-harness/pull/3273 add intel xpu support for HFLM by @kaixuanliu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3211 feat: Add mmlu-redux and it's spanish transaltion as generative task definitions by @luiscosio in https://github.com/EleutherAI/lm-evaluation-harness/pull/2705 Add BabiLong by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3287 Add AIME to task description by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3296 Add humaneval_infilling task by @its-alpesh in https://github.com/EleutherAI/lm-evaluation-harness/pull/3299 Add eqbench tasks in Spanish and Catalan by @priverabsc in https://github.com/EleutherAI/lm-evaluation-harness/pull/3168 [fix] add math and longbench to test dependencies by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3321 Fix: VLLM model when data_parallel_size>1 by @Dornavineeth in https://github.com/EleutherAI/lm-evaluation-harness/pull/3303 unpin datasets; update pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3316 bump to python 3.10 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3337 Longbench v2 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3338 Leverage vllm's tokenizer_info endpoint to avoid manual duplication by @m-misiura in https://github.com/EleutherAI/lm-evaluation-harness/pull/3185 Add support for Titulm Bangla MMLU dataset by @Ismail-Hossain-1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3317 remove duplicate tags/groups by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3343 Align humaneval_64_instruct task label in README to name in yaml file by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3344 Fixes bugs when using gpt series model by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3348 [fix] aime doesn't extract answers by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3353 add global_piqa; add acc_norm_bytes metric by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3368 [fix] crows_pairs dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3378 Fix issue 3355 assertion error by @marksverdhei in https://github.com/EleutherAI/lm-evaluation-harness/pull/3356 fix(gsm8k): align README to yaml file by @neoheartbeats in https://github.com/EleutherAI/lm-evaluation-harness/pull/3388 added azure openai support by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3349 Delegate BOS to the tokenizer; add_bos_token defaults to None by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3347 fix trust_remote_code=True for longbench by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3361 [feat] add graphwalks by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3377 Longbench group fix by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3359 Resolve deprecation of vllm.utils.get_open_port by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3398 Trim whitespace in remove_whitespace filter by @ziqing-huang in https://github.com/EleutherAI/lm-evaluation-harness/pull/3408 Fixes #3391 avoid error on no-output by @Oseltamivir in https://github.com/EleutherAI/lm-evaluation-harness/pull/3395 Fix PIL image hashing to use actual bytes instead of object repr by @tboerstad in https://github.com/EleutherAI/lm-evaluation-harness/pull/3331 [MMLU redux] Do not use samples which do not have error_type="ok" by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3410 fix: resolve duplicate task names and add safeguards. by @giuliolovisotto in https://github.com/EleutherAI/lm-evaluation-harness/pull/3394 Add MATH500 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3311 [bugfix] additional_config parsing by @brian-dellabetta in https://github.com/EleutherAI/lm-evaluation-harness/pull/3393 fix(tasks):pin correct MMLUSR version by @christinaexyou in https://github.com/EleutherAI/lm-evaluation-harness/pull/3350 Fix lambada_multilingual_stablelm by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3294 Fix descriptions in the Moral Stories and Histoires Morales tasks. by @upunaprosk in https://github.com/EleutherAI/lm-evaluation-harness/pull/3374 Replace deprecated torch_dtype parameter with dtype by @AbdulmalikDS in https://github.com/EleutherAI/lm-evaluation-harness/pull/3415 [fix] Fix mmlu_redux not displaying summary table + display to the user the tasks / yaml that are actually pulled by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3406 Rename the conflicting environment variable LOGLEVEL to LMEVAL_LOG_LEVEL by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3418 Update SGLang installation and documentation links by @Bobchenyx in https://github.com/EleutherAI/lm-evaluation-harness/pull/3422 Fix reading custom task configs by @SkyR0ver in https://github.com/EleutherAI/lm-evaluation-harness/pull/3425 New Task: Add CNN-DailyMail (3.0.0) by @preordinary in https://github.com/EleutherAI/lm-evaluation-harness/pull/3426 New Contributors @LearnerSXH made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3234 @ceferisbarov made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3242 @Anri-Lombard made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3246 @babyplutokurt made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3247 @FranValero97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3194 @HallerPatrick made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3222 @Helw150 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3091 @nikita-savelyevv made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3101 @weihao1115 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3174 @jannalulu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3248 @slimfrkha made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3267 @gsaltintas made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3238 @valleruizf made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3167 @TimurAysin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3273 @kaixuanliu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3211 @its-alpesh made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3299 @priverabsc made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3168 @Dornavineeth made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3303 @m-misiura made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3185 @Ismail-Hossain-1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3317 @zinccat made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3348 @marksverdhei made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3356 @neoheartbeats made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3388 @ziqing-huang made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3408 @Oseltamivir made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3395 @tboerstad made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3331 @brian-dellabetta made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3393 @christinaexyou made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3350 @AbdulmalikDS made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3415 @Bobchenyx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3422 @SkyR0ver made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3425 @preordinary made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3426 Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.1...v0.4.9.2
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 15 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
