Fake Job Postings Dataset  - Replication Archive for "TabuLLM: Feature Extraction from Tabular Text Data using Large Language Models"

Mahani, Alireza; Taghavi Azar Sharabiani, Mansour; Bottle, Alex

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Dataset . 2026

Data sources: Datacite

ZENODO

Dataset . 2026

Data sources: Datacite

Fake Job Postings Dataset - Replication Archive for "TabuLLM: Feature Extraction from Tabular Text Data using Large Language Models"

Research datakeyboard_double_arrow_right Dataset 06 Mar 2026 English Publisher:Zenodo

Authors: Mahani, Alireza; Taghavi Azar Sharabiani, Mansour; Bottle, Alex;

doi: 10.5281/zenodo.18884002 , 10.5281/zenodo.18884001

Fake Job Postings Dataset - Replication Archive for "TabuLLM: Feature Extraction from Tabular Text Data using Large Language Models"

- Summary
- Subjects
- Metrics

Abstract

This dataset contains 17,880 job postings labeled as fraudulent (4.84%) or legitimate (95.16%). It is the EMSCAD (Employment Scam Aegean Dataset) originally published by Vidros et al. (2017, Future Internet, doi:10.3390/fi9010006) and distributed via Kaggle (shivamb/real-or-fake-fake-jobposting-prediction). This archive is provided for reviewers and readers of the manuscript "TabuLLM: Feature Extraction from Tabular Text Data using Large Language Models" (submitted to the Journal of Statistical Software), as a convenience mirror to support offline replication without requiring a Kaggle account. File: fake_job_postings.csv - 17,880 rows × 18 columns. Columns include 7 free-text fields (title, location, department, company_profile, description, requirements, benefits), 3 binary indicators, 5 categorical features, 1 numeric identifier (job_id), 1 sparse field (salary_range, 84% missing), and 1 binary target (fraudulent).

Related Organizations

Imperial College London
United Kingdom

Keywords

replication, fraud detection, text classification, job postings, binary classification, tabular data, NLP

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average