
HumMusQA: A Human-Written Music Understanding QA Benchmark Dataset HumMusQA is a benchmark dataset for evaluating music understanding in Large Audio-Language Models (LALMs).It contains 320 human-written multiple-choice questions created and validated by musically trained experts to test perception and interpretation of musical content. This dataset accompanies the paper: Benno Weck, Pablo Puentes, Andrea Poltronieri, Satyajeet Prabhu, and Dmitry Bogdanov. 2026. HumMusQA: A Human-written Music Understanding QA Benchmark Dataset. In Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026), pages 58–67, Rabat, Morocco. Association for Computational Linguistics. Files HumMusQA.csvMain dataset containing all questions. Columns: Song link start time end time Question True answer Distractor 1 Distractor 2 Distractor 3 Main Category Secondary Categories Difficulty metadata.csvTrack metadata and licensing information. Columns: track_id song_link name artist_name album_name license_ccurl audio_excerpts.zipTrimmed audio excerpts corresponding to each question. audio_full.zipFull audio tracks. Licensing Each track follows its respective Creative Commons license, specified in metadata.csv.Users must comply with the license associated with each track. Citation If you use this dataset, please cite: @inproceedings{weck-etal-2026-hummusqa, title = "{H}um{M}us{QA}: A Human-written Music Understanding {QA} Benchmark Dataset", author = "Weck, Benno and Puentes, Pablo and Poltronieri, Andrea and Prabhu, Satyajeet and Bogdanov, Dmitry", editor = "Epure, Elena V. and Oramas, Sergio and Doh, SeungHeon and Ramoneda, Pedro and Kruspe, Anna and Sordo, Mohamed", booktitle = "Proceedings of the 4th Workshop on {NLP} for Music and Audio ({NLP}4{M}us{A} 2026)", month = mar, year = "2026", address = "Rabat, Morocco", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2026.nlp4musa-1.9/", doi = "10.18653/v1/2026.nlp4musa-1.9", pages = "58--67", ISBN = "979-8-89176-369-2", abstract = "The evaluation of music understanding in Large Audio-Language Models (LALMs) requires a rigorously defined benchmark that truly tests whether models can perceive and interpret music, a standard that current data methodologies frequently fail to meet.This paper introduces a meticulously structured approach to music evaluation, proposing a new dataset of 320 hand-written questions curated and validated by experts with musical training, arguing that such focused, manual curation is superior for probing complex audio comprehension.To demonstrate the use of the dataset, we benchmark six state-of-the-art LALMs and additionally test their robustness to uni-modal shortcuts."}
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
