
Yemeni Proverbs: A Benchmark Corpus for Figurative and Cultural Language Modeling This dataset contains 5,252 Yemeni Arabic proverbs paired with their corresponding explanations in Modern Standard Arabic (MSA). The corpus was compiled from four printed proverb anthologies and three publicly accessible digital repositories between January and June 2024. The dataset was created through manual transcription of printed materials and structured extraction of digital sources. All entries were manually verified to ensure accurate pairing between proverb text and its original explanation. Duplicate and incomplete records were removed during preprocessing. Each record includes the following fields: id: Unique integer identifier proverb: Dialectal Yemeni Arabic proverb (UTF-8 encoded) explanation: Explanation in Modern Standard Arabic (transcribed from source) source: Title of the printed anthology or name of the digital repository city: Geographic origin if explicitly stated in the source (otherwise null) url: Direct link to online source when applicable (null for printed sources) The corpus preserves dialectal orthography and does not introduce new explanatory annotations. All explanations were transcribed directly from the original sources. Geographic metadata is available for approximately 27% of entries. No geographic inference was performed when such information was not explicitly provided in the source materials. The dataset is intended to support research in: Figurative language understanding Dialect-aware Arabic NLP Culturally grounded language modeling Evaluation of generative models on non-MSA input Computational folkloristics This repository contains: Yemeni_proverbs.json (primary dataset file, UTF-8 encoded) The dataset is distributed under the Creative Commons Attribution 4.0 (CC BY 4.0) license.Users are responsible for consulting original publishers for access to full source documents under their respective terms.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
