Powered by OpenAIRE graph
Found an issue? Give us feedback
addClaim

EleutherAI/lm-evaluation-harness: lm-eval v0.4.9.2 Release Notes

Authors: Lintang Sutawika; Hailey Schoelkopf; Leo Gao; Baber Abbasi; Stella Biderman; Jonathan Tow; ben fattori; +23 Authors

EleutherAI/lm-evaluation-harness: lm-eval v0.4.9.2 Release Notes

Abstract

This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: Python 3.10 is now the minimum required version. New Benchmarks & Tasks A big wave of new evaluation tasks this release: AIME and MATH500 math reasoning benchmarks by @jannalulu in #3248, #3311 BabiLong and Longbench v2 for long-context evaluation by @jannalulu in #3287, #3338 GraphWalks by @jannalulu in #3377 ZhoBLiMP, BLiMP-NL, TurBLiMP, LM-SynEval, and BHS linguistic benchmarks by @jmichaelov in #3218, #3221, #3219, #3184, #3265 Icelandic WinoGrande by @jmichaelov in #3277 CLIcK Korean benchmark by @shing100 in #3173 MMLU-Redux (generative) and Spanish translation by @luiscosio in #2705 EsBBQ and CaBBQ bias benchmarks by @valleruizf in #3167 EQBench in Spanish and Catalan by @priverabsc in #3168 Anthropic discrim-eval by @Helw150 in #3091 XNLI-VA by @FranValero97 in #3194 Bangla MMLU (Titulm) by @Ismail-Hossain-1 in #3317 HumanEval infilling by @its-alpesh in #3299 CNN-DailyMail 3.0.0 by @preordinary in #3426 Global PIQA and new acc_norm_bytes metric by @baberabb in #3368 Fixes & Improvements Core Changes: Python 3.10 minimum by @jannalulu in #3337 Unpinned datasets library by @baberabb in #3316 BOS token handling: Delegate to tokenizer; add_bos_token now defaults to None by @baberabb in #3347 Renamed LOGLEVEL env var to LMEVAL_LOG_LEVEL to avoid conflicts by @fxmarty-amd in #3418 Resolve duplicate task names with safeguards by @giuliolovisotto in #3394 Task Fixes: Fixed MMLU-Redux to exclude samples without error_type="ok" and display summary table by @fxmarty-amd in #3410, #3406 Fixed AIME answer extraction by @jannalulu in #3353 Fixed LongBench evaluation and group handling by @TimurAysin, @jannalulu in #3273, #3359, #3361 Fixed crows_pairs dataset by @jannalulu in #3378 Fixed Gemma tokenizer add_bos_token not updating by @DarkLight1337 in #3206 Fixed lambada_multilingual_stablelm by @jmichaelov, @HallerPatrick in #3294, #3222 Fixed CodeXGLUE by @gsaltintas in #3238 Pinned correct MMLUSR version by @christinaexyou in #3350 Updated minerva_math by @baberabb in #3259 Backend Fixes: Fixed vLLM import errors when not installed by @fxmarty-amd in #3292 Fixed vLLM data_parallel_size>1 issue by @Dornavineeth in #3303 Resolved deprecated vllm.utils.get_open_port by @DarkLight1337 in #3398 Fixed GPT series model bugs by @zinccat in #3348 Fixed PIL image hashing to use actual bytes by @tboerstad in #3331 Fixed additional_config parsing by @brian-dellabetta in #3393 Fixed batch chunking seed handling with groupby by @slimfrkha in #3047 Fixed no-output error handling by @Oseltamivir in #3395 Replaced deprecated torch_dtype with dtype by @AbdulmalikDS in #3415 Fixed custom task config reading by @SkyR0ver in #3425 Model & Backend Support OpenAI GPT-5 support by @babyplutokurt in #3247 Azure OpenAI support by @zinccat in #3349 Fine-tuned Gemma3 evaluation support by @LearnerSXH in #3234 OpenVINO text2text models by @nikita-savelyevv in #3101 Intel XPU support for HFLM by @kaixuanliu in #3211 Attention head steering support by @luciaquirke in #3279 Leverage vLLM's tokenizer_info endpoint to avoid manual duplication by @m-misiura in #3185 What's Changed Remove trust_remote_code: True from updated datasets by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3213 Add support for evaluating with fine-tuned Gemma3 by @LearnerSXH in https://github.com/EleutherAI/lm-evaluation-harness/pull/3234 Fix add_bos_token not updated for Gemma tokenizer by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3206 remove incomplete compilation instructions, solves #3233 by @ceferisbarov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3242 Update utils.py by @Anri-Lombard in https://github.com/EleutherAI/lm-evaluation-harness/pull/3246 Adding support for OpenAI GPT-5 model by @babyplutokurt in https://github.com/EleutherAI/lm-evaluation-harness/pull/3247 Add xnli_va dataset by @FranValero97 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3194 Add ZhoBLiMP benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3218 Add BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3221 Add TurBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3219 Add LM-SynEval Benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3184 Fix unknown group key to tag in yaml config for lambada_multilingual_stablelm by @HallerPatrick in https://github.com/EleutherAI/lm-evaluation-harness/pull/3222 update minerva_math by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3259 feat: Add CLIcK task by @shing100 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3173 Adds Anthropic/discrim-eval to lm-evaluation-harness by @Helw150 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3091 Add support for OpenVINO text2text generation models by @nikita-savelyevv in https://github.com/EleutherAI/lm-evaluation-harness/pull/3101 Update MMLU-ProX task by @weihao1115 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3174 Support for AIME dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3248 feat(scrolls): delete chat_template from kwargs by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3267 pacify pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3268 Fix codexglue by @gsaltintas in https://github.com/EleutherAI/lm-evaluation-harness/pull/3238 Add BHS benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3265 Add acc_norm metric to BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3272 Add acc_norm metric to ZhoBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3271 Add EsBBQ and CaBBQ tasks by @valleruizf in https://github.com/EleutherAI/lm-evaluation-harness/pull/3167 Add support for steering individual attention heads by @luciaquirke in https://github.com/EleutherAI/lm-evaluation-harness/pull/3279 Add the Icelandic WinoGrande benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3277 Ignore seed when splitting batch in chunks with groupby by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3047 [fix][vllm] Avoid import errors in case vllm is not installed by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3292 Fix LongBench Evaluation by @TimurAysin in https://github.com/EleutherAI/lm-evaluation-harness/pull/3273 add intel xpu support for HFLM by @kaixuanliu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3211 feat: Add mmlu-redux and it's spanish transaltion as generative task definitions by @luiscosio in https://github.com/EleutherAI/lm-evaluation-harness/pull/2705 Add BabiLong by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3287 Add AIME to task description by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3296 Add humaneval_infilling task by @its-alpesh in https://github.com/EleutherAI/lm-evaluation-harness/pull/3299 Add eqbench tasks in Spanish and Catalan by @priverabsc in https://github.com/EleutherAI/lm-evaluation-harness/pull/3168 [fix] add math and longbench to test dependencies by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3321 Fix: VLLM model when data_parallel_size>1 by @Dornavineeth in https://github.com/EleutherAI/lm-evaluation-harness/pull/3303 unpin datasets; update pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3316 bump to python 3.10 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3337 Longbench v2 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3338 Leverage vllm's tokenizer_info endpoint to avoid manual duplication by @m-misiura in https://github.com/EleutherAI/lm-evaluation-harness/pull/3185 Add support for Titulm Bangla MMLU dataset by @Ismail-Hossain-1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3317 remove duplicate tags/groups by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3343 Align humaneval_64_instruct task label in README to name in yaml file by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3344 Fixes bugs when using gpt series model by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3348 [fix] aime doesn't extract answers by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3353 add global_piqa; add acc_norm_bytes metric by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3368 [fix] crows_pairs dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3378 Fix issue 3355 assertion error by @marksverdhei in https://github.com/EleutherAI/lm-evaluation-harness/pull/3356 fix(gsm8k): align README to yaml file by @neoheartbeats in https://github.com/EleutherAI/lm-evaluation-harness/pull/3388 added azure openai support by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3349 Delegate BOS to the tokenizer; add_bos_token defaults to None by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3347 fix trust_remote_code=True for longbench by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3361 [feat] add graphwalks by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3377 Longbench group fix by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3359 Resolve deprecation of vllm.utils.get_open_port by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3398 Trim whitespace in remove_whitespace filter by @ziqing-huang in https://github.com/EleutherAI/lm-evaluation-harness/pull/3408 Fixes #3391 avoid error on no-output by @Oseltamivir in https://github.com/EleutherAI/lm-evaluation-harness/pull/3395 Fix PIL image hashing to use actual bytes instead of object repr by @tboerstad in https://github.com/EleutherAI/lm-evaluation-harness/pull/3331 [MMLU redux] Do not use samples which do not have error_type="ok" by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3410 fix: resolve duplicate task names and add safeguards. by @giuliolovisotto in https://github.com/EleutherAI/lm-evaluation-harness/pull/3394 Add MATH500 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3311 [bugfix] additional_config parsing by @brian-dellabetta in https://github.com/EleutherAI/lm-evaluation-harness/pull/3393 fix(tasks):pin correct MMLUSR version by @christinaexyou in https://github.com/EleutherAI/lm-evaluation-harness/pull/3350 Fix lambada_multilingual_stablelm by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3294 Fix descriptions in the Moral Stories and Histoires Morales tasks. by @upunaprosk in https://github.com/EleutherAI/lm-evaluation-harness/pull/3374 Replace deprecated torch_dtype parameter with dtype by @AbdulmalikDS in https://github.com/EleutherAI/lm-evaluation-harness/pull/3415 [fix] Fix mmlu_redux not displaying summary table + display to the user the tasks / yaml that are actually pulled by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3406 Rename the conflicting environment variable LOGLEVEL to LMEVAL_LOG_LEVEL by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3418 Update SGLang installation and documentation links by @Bobchenyx in https://github.com/EleutherAI/lm-evaluation-harness/pull/3422 Fix reading custom task configs by @SkyR0ver in https://github.com/EleutherAI/lm-evaluation-harness/pull/3425 New Task: Add CNN-DailyMail (3.0.0) by @preordinary in https://github.com/EleutherAI/lm-evaluation-harness/pull/3426 New Contributors @LearnerSXH made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3234 @ceferisbarov made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3242 @Anri-Lombard made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3246 @babyplutokurt made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3247 @FranValero97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3194 @HallerPatrick made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3222 @Helw150 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3091 @nikita-savelyevv made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3101 @weihao1115 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3174 @jannalulu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3248 @slimfrkha made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3267 @gsaltintas made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3238 @valleruizf made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3167 @TimurAysin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3273 @kaixuanliu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3211 @its-alpesh made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3299 @priverabsc made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3168 @Dornavineeth made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3303 @m-misiura made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3185 @Ismail-Hossain-1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3317 @zinccat made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3348 @marksverdhei made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3356 @neoheartbeats made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3388 @ziqing-huang made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3408 @Oseltamivir made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3395 @tboerstad made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3331 @brian-dellabetta made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3393 @christinaexyou made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3350 @AbdulmalikDS made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3415 @Bobchenyx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3422 @SkyR0ver made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3425 @preordinary made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3426 Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.1...v0.4.9.2

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    15
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Top 10%
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Top 10%
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Top 10%
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
15
Top 10%
Top 10%
Top 10%