RDFS-LLM-Bench: A Benchmark for Evaluating RDF Schema Inference in LLMs

This dataset accompanies the benchmark RDFS-LLM-Bench, which systematicallyevaluates how well large language models (LLMs) can perform RDFS inference. The benchmark covers six RDFS entailment rules (rdfs2, rdfs3, rdfs5, rdfs7,rdfs9, rdfs11) and 19 entailment patterns (six 1-rule, seven 2-rule, andsix 3-rule patterns), seven dataset variants, and evaluation conditionsdefined by combinations of presented rule types (NRP/ARP) and rule formats(full/name/def). Contents:- lod-samples.zip: Raw SPARQL query results from DBpedia, Wikidata, and schema.org- datasets.zip: Benchmark datasets (seven variants: RK, LS, GS, GSC, NS, NSC, RVA)- tasks.zip: Zero-shot prompt task files for each evaluation condition- requests.zip: LLM request files (OpenAI Batch / sequential format)- responses.zip: Raw LLM response files- eval.zip: Per-entry evaluation results (strict and flex modes)- reports.zip: Aggregated scores and composite metrics (CSV / Excel)- reasoning-trace-samples.zip: Reasoning trace samples collected separately from gpt-oss-120b and gpt-oss-20b. Because the main experimental pipeline did not capture reasoning content, a dedicated sampling pass was conducted specifically to record reasoning traces. Eight (prompting condition × dataset variant) conditions are covered per model: NRP/ARP × full on RK/GS/NS, plus NRP/ARP × name on NS. Data Sources:- DBpedia (https://dbpedia.org) — CC BY-SA 3.0- Wikidata (https://www.wikidata.org) — CC0 1.0- schema.org (https://schema.org) — CC BY-SA 3.0 Data was collected via SPARQL queries against the public endpoints of theabove sources. Maintenance and Sustainability: Active Maintenance:- Maintained by the authors at Aoyama Gakuin University.- GitHub issues are reviewed on a best-effort basis, typically within 30 days.- New versions are released on Zenodo (following semantic versioning) for added datasets, models, or rules. Long-term Accessibility:- This Zenodo deposit is permanently archived under Zenodo's long-term preservation policy.- The benchmark remains accessible regardless of GitHub repository status.- Source code is MIT-licensed; community fork / mirror is welcome.

Found an issue? Give us feedback