BanglaPhish-2026: A Synthetic Bangla Phishing and Scam Detection Benchmark for Cybersecurity NLP

We introduce BanglaPhish-2026, a large-scale, class-balanced, synthetic Bangla-language benchmarkdataset for phishing and scam message detection. The dataset contains 6,000 text samples spanning 30 real-world-inspired domains — 3,000 scam messages and 3,000 legitimate messages — covering mobile financial servicesfraud, OTP theft, fake job offers, lottery scams, government impersonation, bank verification scams, and corre-sponding authentic notification counterparts. All records are synthetically generated and redacted; no real personaldata, credentials, phone numbers, OTPs, or live malicious URLs are present. The dataset is partitioned into stan-dardised train (4,200), validation (900), and test (900) splits, with human quality review on a stratified 30% sample(Cohen's κ = 0.94, Fleiss' κ = 0.93). Three classical character n-gram TF-IDF baselines (Logistic Regression, LinearSVM, Multinomial Naive Bayes) are evaluated under five settings: random split, template-disjoint split, domain-held-out split (24 train domains / 6 unseen test domains), an adversarial-hard split (heavy transliteration, zero-widthinjection, mimicry; 390 records), and a 200-record real-world-style supplement with organic-text perturbations.Classical models saturate at macro-F1 of 1.000 under random and template-disjoint splits and reach LR macro-F1 =0.995 on the real-world supplement; the adversarial-hard probe drops the strongest classical model to macro-F1 =0.794 (−20.6 points). We additionally report two transformer baselines: (a) a CPU-feasible frozen multilingualMiniLM (XLM-R-distilled) with a logistic regression head, which is weaker than classical TF-IDF across all five set-tings (macro-F1 0.979 / 0.979 / 0.964 / 0.416 / 0.831 for Settings 1–5) — evidence that the dataset’s primary signal issurface character regularity rather than deep semantics; and (b) genuine end-to-end fine-tuning of DistilmBERT (, CPU-only), which achieves macro-F1 = 0.9989/AUC = 0.9999 on Setting 1 andmacro-F1 = 1.000 on the template-disjoint Setting 2 — confirming that fine-tuning closes the frozen encoder gap.Fine-tuning scripts for BanglaBERT, XLM-RoBERTa, mBERT, and DistilmBERT are also released as communitybenchmarks. We position BanglaPhish-2026 explicitly as a controlled synthetic benchmark for probing template mem-orisation, domain generalisation, and adversarial robustness in Bangla cybersecurity NLP — not as a substitute forin-the-wild organic phishing evaluation. BanglaPhish-2026 addresses the critical scarcity of publicly availableBangla cybersecurity corpora and is released under CC BY-NC 4.0.

Found an issue? Give us feedback