
HDCR: Cross-lingual Medical Misinformation Detection Dataset Dataset Overview This dataset contains 72,275 cross-lingual claim-evidence pairs for detecting fine-grained medical misinformation in English and Chinese health communications. Each sample pairs a health claim with peer-reviewed biomedical evidence and provides fine-grained distortion labels. Dataset Description Data Format The dataset is provided in three JSON files: train.json (43,363 samples, 60%), dev.json (7,229 samples, 10%), and test.json (21,683 samples, 30%). Each JSON file contains a list of samples with the following structure: {"id": "0039959", "claim": "Workplace health promotion program delivers significant benefits", "document": "What Can You Achieve in 8 Years? A Case Study on Workplace Health Promotion Program...", "class_label": 3, "language": "en"} Field Descriptions id: Unique identifier for each sample claim: Health claim from medical news sources (English or Chinese) document: Corresponding scientific evidence (title and abstract from peer-reviewed publications) class_label: Distortion category (0-4, see below) language: Language of the claim ("en" for English, "cn" for Chinese) Label Definitions Label 0 - Not Misinformation: Accurate claim with no distortion Label 1 - Over-generalization: Inappropriately extending limited research findings to broader populations or situations beyond validated scope Label 2 - Improper Restriction: Inappropriately narrowing the applicability of well-established medical evidence to specific populations without scientific justification Label 3 - Effect Exaggeration: Inappropriately amplifying treatment effects, risk levels, or statistical significance beyond what evidence supports Label 4 - Spurious Causation: Incorrectly interpreting correlation or temporal association as a causal relationship without sufficient evidence Dataset Statistics Overall Distribution Total samples: 72,275 English claims: 65,127 (90.1%) Chinese claims: 7,148 (9.9%) Training samples: 43,363 (60%) Development samples: 7,229 (10%) Test samples: 21,683 (30%) Label Distribution Not Misinformation (0): 14,455 samples (20%) Over-generalization (1): 14,455 samples (20%) Improper Restriction (2): 14,455 samples (20%) Effect Exaggeration (3): 14,455 samples (20%) Spurious Causation (4): 14,455 samples (20%) Language Distribution by Split Train: 39,195 English + 4,168 Chinese Dev: 6,532 English + 697 Chinese Test: 19,400 English + 2,283 Chinese Data Collection Source Materials Health News Articles: 16,547 articles from authoritative medical journalism platforms. English sources include 15,108 articles from Reuters Health, CNN Health, MedPage Today, Medical News Today, and STAT News (2018-2025). Chinese sources include 1,439 articles from Chinese medical news platforms and health websites (2017-2023). Scientific Evidence: Peer-reviewed publications from PubMed-indexed journals. Titles and abstracts were extracted via PubMed API. Each claim was verified against original biomedical literature. Recommended Tasks Primary Task 5-class Fine-grained Medical Misinformation Detection: Classify health claims into one of five categories (accurate or four distortion types) Alternative Tasks Binary Classification: Distinguish accurate claims from any type of medical distortion 4-class Classification: Categorize distorted claims by clinical risk type (excluding accurate claims) Cross-lingual Transfer: Evaluate model robustness across English and Chinese medical claims Version History v1.0 (2025): Initial release with 72,275 samples across English and Chinese Ethical Considerations This dataset is intended for research purposes to improve automated detection of medical misinformation. Users should not use this dataset to deliberately generate or spread medical misinformation. Users should consider the potential impact of false positives in clinical decision-making contexts. Users should be aware of the limitations of automated systems in replacing medical expertise. Users should respect patient privacy and confidentiality when developing applications using this dataset.
How to Cite If you use this dataset in your research, please cite both the dataset and the associated paper: Dataset Citation: Zuo, C., & Banerjee, R. (2025). HDCR: Cross-lingual Medical Misinformation Detection Dataset [Dataset]. Zenodo. https://doi.org/110.5281/zenodo.17486207 BibTeX: @dataset{biomedicalexpert2025, author = {Zuo, Chaoyuan and Banerjee, Ritwik}, title = {{HDCR: Cross-lingual Medical Misinformation Detection Dataset}}, year = {2025}, publisher = {Zenodo}, doi = {110.5281/zenodo.17486207}, url = {https://doi.org/10.5281/zenodo.17486207}} **Paper Citation:** Zuo, C., Wang, Chenlu, & Banerjee, R. (2025). HDCR: Cross-lingual Medical Misinformation Detection through Contrastive Claim-Evidence Reasoning. IEEE International Conference on Bioinformatics and Biomedicine (BIBM). .BibTeX: @inproceedings{healthclaimexperts2025paper, author = {Zuo, Chaoyuan and Banerjee, Ritwik}, title = {{HDCR: Cross-lingual Medical Misinformation Detection through Contrastive Claim-Evidence Reasoning}}, booktitle = {IEEE International Conference on Bioinformatics and Biomedicine (BIBM)}, year = {2025}}
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
