
What's Changed Replace stale triviaqa dataset link by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/364 Update actions/setup-pythonin CI workflows by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/365 Bump triviaqa version by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/366 Update lambada_openai multilingual data source by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/370 Update Pile Test/Val Download URLs by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/373 Added ToxiGen task by @Thartvigsen in https://github.com/EleutherAI/lm-evaluation-harness/pull/377 Added CrowSPairs by @aflah02 in https://github.com/EleutherAI/lm-evaluation-harness/pull/379 Add accuracy metric to crows-pairs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/380 hotfix(gpt2): Remove vocab-size logits slice by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/384 Enable "low_cpu_mem_usage" to reduce the memory usage of HF models by @sxjscience in https://github.com/EleutherAI/lm-evaluation-harness/pull/390 Upstream hf-causal and hf-seq2seq model implementations by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/381 Hosting arithmetic dataset on HuggingFace by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/391 Hosting wikitext on HuggingFace by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/396 Change device parameter to cuda:0 to avoid runtime error by @Jeffwan in https://github.com/EleutherAI/lm-evaluation-harness/pull/403 Update README installation instructions by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/407 feat: evaluation using peft models with CLM by @zanussbaum in https://github.com/EleutherAI/lm-evaluation-harness/pull/414 Update setup.py dependencies by @ret2libc in https://github.com/EleutherAI/lm-evaluation-harness/pull/416 fix: add seq2seq peft by @zanussbaum in https://github.com/EleutherAI/lm-evaluation-harness/pull/418 Add support for load_in_8bit and trust_remote_code model params by @philwee in https://github.com/EleutherAI/lm-evaluation-harness/pull/422 Hotfix: patch issues with the huggingface.py model classes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/427 Continuing work on refactor [WIP] by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/425 Document task name wildcard support in README by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/435 Add non-programmatic BIG-bench-hard tasks by @yurodiviy in https://github.com/EleutherAI/lm-evaluation-harness/pull/406 Updated handling for device in lm_eval/models/gpt2.py by @nikhilpinnaparaju in https://github.com/EleutherAI/lm-evaluation-harness/pull/447 [WIP, Refactor] Staging more changes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/465 [Refactor, WIP] Multiple Choice + loglikelihood_rolling support for YAML tasks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/467 Configurable-Tasks by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/438 single GPU automatic batching logic by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/394 Fix bugs introduced in #394 #406 and max length bug by @juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/472 Sort task names to keep the same order always by @juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/474 Set PAD token to EOS token by @nikhilpinnaparaju in https://github.com/EleutherAI/lm-evaluation-harness/pull/448 [Refactor] Add decorator for registering YAMLs as tasks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/486 fix adaptive batch crash when there are no new requests by @jquesnelle in https://github.com/EleutherAI/lm-evaluation-harness/pull/490 Add multilingual datasets (XCOPA, XStoryCloze, XWinograd, PAWS-X, XNLI, MGSM) by @juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/426 Create output path directory if necessary by @janEbert in https://github.com/EleutherAI/lm-evaluation-harness/pull/483 Add results of various models in json and md format by @juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/477 Update config by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/501 P3 prompt task by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/493 Evaluation Against Portion of Benchmark Data by @kenhktsui in https://github.com/EleutherAI/lm-evaluation-harness/pull/480 Add option to dump prompts and completions to a JSON file by @juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/492 Add perplexity task on arbitrary JSON data by @janEbert in https://github.com/EleutherAI/lm-evaluation-harness/pull/481 Update config by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/520 Data Parallelism by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/488 Fix mgpt fewshot by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/522 Extend dtype command line flag to HFLM by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/523 Add support for loading GPTQ models via AutoGPTQ by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/519 Change type signature of quantized and its default value for python < 3.11 compatibility by @passaglia in https://github.com/EleutherAI/lm-evaluation-harness/pull/532 Fix LLaMA tokenization issue by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/531 [Refactor] Make promptsource an extra / not required for installation by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/542 Move spaces from context to continuation by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/546 Use max_length in AutoSeq2SeqLM by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/551 Fix typo by @kwikiel in https://github.com/EleutherAI/lm-evaluation-harness/pull/557 Add load_in_4bit and fix peft loading by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/556 Update task_guide.md by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/564 [Refactor] Non-greedy generation ; WIP GSM8k yaml by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/559 Dataset metric log [WIP] by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/560 Add Anthropic support by @zphang in https://github.com/EleutherAI/lm-evaluation-harness/pull/562 Add MultipleChoiceExactTask by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/537 Revert "Add MultipleChoiceExactTask" by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/568 [Refactor] [WIP] New YAML advanced docs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/567 Remove the registration of "GPT2" as a model type by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/574 [Refactor] Docs update by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/577 Better docs by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/576 Update evaluator.py cache_db argument str if model is not str by @poedator in https://github.com/EleutherAI/lm-evaluation-harness/pull/575 Add --max_batch_size and --batch_size auto:N by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/572 [Refactor] ALL_TASKS now maintained (not static) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/581 Fix seqlen issues for bloom, remove extraneous OPT tokenizer check by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/582 Fix non-callable attributes in CachingLM by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/584 Add error handling for calling .to(device) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/585 fixes some minor issues on tasks. by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/580 Add - 4bit-related args by @SONG-WONHO in https://github.com/EleutherAI/lm-evaluation-harness/pull/579 Fix triviaqa task by @seopbo in https://github.com/EleutherAI/lm-evaluation-harness/pull/525 [Refactor] Addressing Feedback on new docs pages by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/578 Logging Samples by @farzanehnakhaee70 in https://github.com/EleutherAI/lm-evaluation-harness/pull/563 Merge master into big-refactor by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/590 [Refactor] Package YAMLs alongside pip installations of lm-eval by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/596 fixes for multiple_choice by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/598 add openbookqa config by @farzanehnakhaee70 in https://github.com/EleutherAI/lm-evaluation-harness/pull/600 [Refactor] Model guide docs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/606 [Refactor] More MCQA fixes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/599 [Refactor] Hellaswag by @nopperl in https://github.com/EleutherAI/lm-evaluation-harness/pull/608 [Refactor] Seq2Seq Models with Multi-Device Support by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/565 [Refactor] CachingLM support via --use_cache by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/619 [Refactor] batch generation better for hf model ; deprecate hf-causal in new release by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/613 [Refactor] Update task statuses on tracking list by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/629 [Refactor] device_map options for hf model type by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/625 [Refactor] Misc. cleanup of dead code by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/609 [Refactor] Log request arguments to per-sample json by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/624 [Refactor] HellaSwag YAML fix by @nopperl in https://github.com/EleutherAI/lm-evaluation-harness/pull/639 [Refactor] Add caveats to parallelize=True docs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/638 fixed super_glue and removed unused yaml config by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/645 [Refactor] Fix sample logging by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/646 Add PEFT, quantization, remote code, LLaMA fix by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/644 [Refactor] Handle cuda:0 device assignment by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/647 [refactor] Add prost config by @farzanehnakhaee70 in https://github.com/EleutherAI/lm-evaluation-harness/pull/640 [Refactor] Misc. bugfixes ; edgecase quantized models by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/648 Update init.py by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/650 [Refactor] Add Lambada Multilingual by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/658 [Refactor] Add: SWAG,RACE,Arithmetic,Winogrande,PubmedQA by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/627 [refactor] Add qa4mre config by @farzanehnakhaee70 in https://github.com/EleutherAI/lm-evaluation-harness/pull/651 Update generation_kwargs by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/657 [Refactor] Move race dataset on HF to EleutherAI group by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/661 [Refactor] Add Headqa by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/659 [Refactor] Add Unscramble ; Toxigen ; Hendrycks_Ethics ; MathQA by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/660 [Refactor] Port TruthfulQA (mc1 only) by @nopperl in https://github.com/EleutherAI/lm-evaluation-harness/pull/666 [Refactor] Miscellaneous fixes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/676 [Refactor] Patch to revamp-process by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/678 Revamp process by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/671 [Refactor] Fix padding ranks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/679 [Refactor] minor edits by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/680 [Refactor] Migrate ANLI tasks to yaml by @yeoedward in https://github.com/EleutherAI/lm-evaluation-harness/pull/682 edited output_path and added help to args by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/684 [Refactor] Minor changes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/685 [Refactor] typo by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/687 [Test] fix test_evaluator.py by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/675 Fix dummy model not invoking super class constructor by @yeoedward in https://github.com/EleutherAI/lm-evaluation-harness/pull/688 [Refactor] Migrate webqs task to yaml by @yeoedward in https://github.com/EleutherAI/lm-evaluation-harness/pull/689 [Refactor] Fix tests by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/693 [Refactor] Migrate xwinograd tasks to yaml by @yeoedward in https://github.com/EleutherAI/lm-evaluation-harness/pull/695 Early stop bug of greedy_until (primary_until should be a list of str) by @ZZR0 in https://github.com/EleutherAI/lm-evaluation-harness/pull/700 Remove condition to check for winograd_schema by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/690 [Refactor] Use console script by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/703 [Refactor] Fixes for when using num_fewshot by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/702 [Refactor] Updated anthropic to new API by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/710 [Refactor] Cleanup for big-refactor by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/686 Update README.md by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/720 [Refactor] Benchmark scripts by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/612 [Refactor] Fix Max Length arg by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/723 Add note about MPS by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/728 Update huggingface.py by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/730 Update README.md by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/732 [Refactor] Port over Autobatching by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/673 [Refactor] Fix Anthropic Import and other fixes by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/724 [Refactor] Remove Unused Variable in Make-Table by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/734 [Refactor] logiqav2 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/711 [Refactor] Fix task packaging by @yeoedward in https://github.com/EleutherAI/lm-evaluation-harness/pull/739 [Refactor] fixed openai by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/736 [Refactor] added some typehints by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/742 [Refactor] Port Babi task by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/752 [Refactor] CrowS-Pairs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/751 Update README.md by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/745 [Refactor] add xcopa by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/749 Update README.md by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/764 [Refactor] Add Blimp by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/763 [Refactor] Use evaluation mode for accelerate to prevent OOM by @tju01 in https://github.com/EleutherAI/lm-evaluation-harness/pull/770 Patch Blimp by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/768 [Refactor] Speedup hellaswag context building by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/774 [Refactor] Patch crowspairs higher_is_better by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/766 [Refactor] XNLI by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/776 [Refactor] Update Benchmark by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/777 [WIP] Update API docs in README by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/747 [Refactor] Real Toxicity Prompts by @aflah02 in https://github.com/EleutherAI/lm-evaluation-harness/pull/725 [Refactor] XStoryCloze by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/759 [Refactor] Glue by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/761 [Refactor] Add triviaqa by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/758 [Refactor] Paws-X by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/779 [Refactor] MC Taco by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/783 [Refactor] Truthfulqa by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/782 [Refactor] fix doc_to_target processing by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/786 [Refactor] Add README.md by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/757 [Refactor] Don't always require Perspective API key to run by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/788 [Refactor] Added HF model test by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/791 [Big refactor] HF test fixup by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/793 [Refactor] Process Whitespace for greedy_until by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/781 [Refactor] Fix metrics in Greedy Until by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/780 Update README.md by @Wehzie in https://github.com/EleutherAI/lm-evaluation-harness/pull/803 Merge Fix metrics branch by @uSaiPrashanth in https://github.com/EleutherAI/lm-evaluation-harness/pull/802 [Refactor] Update docs by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/744 [Refactor] Superglue T5 Parity by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/769 Update main.py by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/817 [Refactor] Coqa by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/820 [Refactor] drop by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/821 [Refactor] Asdiv by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/813 [Refactor] Fix IndexError by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/819 [Refactor] toxicity: API inside function by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/822 [Refactor] wsc273 by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/807 [Refactor] Bump min accelerate version and update documentation by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/812 Add mypy baseline config by @ethanhs in https://github.com/EleutherAI/lm-evaluation-harness/pull/809 [Refactor] Fix wikitext task by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/833 [Refactor] Add WMT tasks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/775 [Refactor] consolidated tasks tests by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/831 Update README.md by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/838 [Refactor] mgsm by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/784 [Refactor] Add top-level import by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/830 Add pyproject.toml by @ethanhs in https://github.com/EleutherAI/lm-evaluation-harness/pull/810 [Refactor] Additions to docs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/799 [Refactor] Fix MGSM by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/845 [Refactor] float16 MPS works in torch nightly by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/853 [Refactor] Update benchmark by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/850 Switch to pyproject.toml based project metadata by @ethanhs in https://github.com/EleutherAI/lm-evaluation-harness/pull/854 Use Dict to make the code python 3.8 compatible by @chrisociepa in https://github.com/EleutherAI/lm-evaluation-harness/pull/857 [Refactor] NQopen by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/859 [Refactor] NQ-open by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/798 Fix "local variable 'docs' referenced before assignment" error in write_out.py by @chrisociepa in https://github.com/EleutherAI/lm-evaluation-harness/pull/856 [Refactor] 3.8 test compatibility by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/863 [Refactor] Cleanup dependencies by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/860 [Refactor] Qasper, MuTual, MGSM (Native CoT) by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/840 undefined type and output_type when using promptsource fixed by @Hojjat-Mokhtarabadi in https://github.com/EleutherAI/lm-evaluation-harness/pull/842 [Refactor] Deactivate select GH Actions by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/871 [Refactor] squadv2 by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/785 [Refactor] Set python3.8 as allowed version by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/862 Fix positional arguments in HF model generate by @chrisociepa in https://github.com/EleutherAI/lm-evaluation-harness/pull/877 [Refactor] MATH by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/861 Create cot_yaml by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/870 [Refactor] Port CSATQA to refactor by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/865 [Refactor] CMMLU, C-Eval port ; Add fewshot config by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/864 [Refactor] README.md for Asdiv by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/878 [Refactor] Hotfixes to big-refactor by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/880 Change Python Version to 3.8 in .pre-commit-config.yaml and GitHub Actions by @chrisociepa in https://github.com/EleutherAI/lm-evaluation-harness/pull/895 [Refactor] Fix PubMedQA by @tmabraham in https://github.com/EleutherAI/lm-evaluation-harness/pull/890 [Refactor] Fix error when calling lm-eval by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/899 [Refactor] bigbench by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/852 [Refactor] Fix wildcards by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/900 Add transformation filters by @chrisociepa in https://github.com/EleutherAI/lm-evaluation-harness/pull/883 [Refactor] Flan benchmark by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/816 [Refactor] WIP: Add MMLU by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/753 Added notable contributors to the citation block by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/907 [Refactor] Improve error logging by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/908 [Refactor] Add _batch_scheduler in greedy_until by @AndyWolfZwei in https://github.com/EleutherAI/lm-evaluation-harness/pull/912 add belebele by @ManuelFay in https://github.com/EleutherAI/lm-evaluation-harness/pull/885 Update README.md by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/917 [Refactor] Precommit formatting for Belebele by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/926 [Refactor] change all mentions of greedy_until to generate_until by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/927 [Refactor] Squadv2 updates by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/923 [Refactor] Verbose by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/910 [Refactor] Fix Unit Tests by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/905 Fix generate_until rename by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/929 [Refactor] Generate_until rename by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/931 Fix 'tqdm' object is not subscriptable" error in huggingface.py when batch size is auto by @jasonkrone in https://github.com/EleutherAI/lm-evaluation-harness/pull/916 [Refactor] Fix Default Metric Call by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/935 Big refactor write out adaption by @MicPie in https://github.com/EleutherAI/lm-evaluation-harness/pull/937 Update pyproject.toml by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/915 [Refactor] Fix whitespace warning by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/949 [Refactor] Update documentation by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/954 [Refactor]fix two bugs when ran with qasper_bool and toxigen by @AndyWolfZwei in https://github.com/EleutherAI/lm-evaluation-harness/pull/934 [Refactor] Describe local dataset usage in docs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/956 [Refactor] Update README, documentation by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/955 [Refactor] Don't load MMLU auxiliary_train set by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/953 [Refactor] Patch for Generation Until by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/957 [Refactor] Model written eval by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/815 [Refactor] Bugfix: AttributeError: 'Namespace' object has no attribute 'verbose' by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/966 [Refactor] Mmlu subgroups and weight avg by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/922 [Refactor] Remove deprecated gold_alias task YAML option by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/965 [Refactor] Logging fixes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/952 [Refactor] fixes for alternative MMLU tasks. by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/981 [Refactor] Alias fix by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/987 [Refactor] Minor cleanup on base Task subclasses by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/996 [Refactor] add squad from master by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/971 [Refactor] Squad misc by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/999 [Refactor] Fix CI tests by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/997 [Refactor] will check if group_name is None by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1001 [Refactor] Bugfixes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1002 [Refactor] Verbosity rework by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/958 add description on task/group alias by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/979 [Refactor] Upstream ggml from big-refactor branch by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/967 [Refactor] Improve Handling of Stop-Sequences for HF Batched Generation by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1009 [Refactor] Update README by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1020 [Refactor] Remove examples/ folder by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1018 [Refactor] vllm support by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1011 Allow Generation arguments on greedy_until reqs by @uSaiPrashanth in https://github.com/EleutherAI/lm-evaluation-harness/pull/897 Social iqa by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1030 [Refactor] BBH fixup by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1029 Rename bigbench.yml to default.yml by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1032 [Refactor] Num_fewshot process by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/985 [Refactor] Use correct HF model type for MBart-like models by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1024 [Refactor] Urgent fix by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1033 [Refactor] Versioning by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1031 fixes for sampler by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1038 [Refactor] Update README.md by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1046 [refactor] mps requirement by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1037 [Refactor] Additions to example notebook by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1048 Miscellaneous documentation updates by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1047 [Refactor] add notebook for overview by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1025 Update README.md by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1049 [Refactor] Openai completions by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1008 [Refactor] Added support for OpenAI ChatCompletions by @DaveOkpare in https://github.com/EleutherAI/lm-evaluation-harness/pull/839 [Refactor] Update docs ToC by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1051 [Refactor] Fix fewshot cot mmlu descriptions by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1060 New Contributors @fattorib made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/373 @Thartvigsen made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/377 @aflah02 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/379 @sxjscience made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/390 @Jeffwan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/403 @zanussbaum made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/414 @ret2libc made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/416 @philwee made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/422 @yurodiviy made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/406 @nikhilpinnaparaju made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/447 @lintangsutawika made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/438 @juletx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/472 @janEbert made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/483 @kenhktsui made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/480 @passaglia made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/532 @kwikiel made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/557 @poedator made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/575 @SONG-WONHO made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/579 @seopbo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/525 @farzanehnakhaee70 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/563 @nopperl made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/608 @yeoedward made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/682 @ZZR0 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/700 @tju01 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/770 @Wehzie made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/803 @uSaiPrashanth made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/802 @ethanhs made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/809 @chrisociepa made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/857 @Hojjat-Mokhtarabadi made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/842 @AndyWolfZwei made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/912 @ManuelFay made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/885 @jasonkrone made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/916 @MicPie made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/937 @DaveOkpare made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/839 Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.3.0...v0.4.0
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 14 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
