Spam Images in Messaging - Annotated Set (SIMAS)

SIMAS Dataset This archive includes the SIMAS dataset for fine-tuning models for MMS (Multimedia Messaging Service) image moderation. SIMAS is a balanced collection of publicly available images, manually annotated in accordance with a specialized taxonomy designed for identifying visual spam in MMS messages. Taxonomy for MMS Visual Spam The following table presents the definitions of categories used for classifying MMS images. Table 1: Category definitions Category Description Alcohol* Content related to alcoholic beverages, including advertisements and consumption. Drugs* Content related to the use, sale, or trafficking of narcotics (e.g., cannabis, cocaine, Firearms* Content involving guns, pistols, knives, or military weapons. Gambling* Content related to gambling (casinos, poker, roulette, lotteries). Sexual Content involving nudity, sexual acts, or sexually suggestive material. Tobacco* Content related to tobacco use and advertisements. Violence Content showing violent acts, self-harm, or injury. Safe All other content, including neutral depictions, products, or harmless cultural symbols Note: Categories marked with an asterisk are regulated in some jurisdictions and may not be universally restricted. Dataset Collection and Annotation Data Sources The SIMAS dataset combines publicly available images from multiple sources, selected to reflect the categories defined in our content taxonomy. Each image was manually reviewed by three independent annotators, with final labels assigned when at least two annotators agreed. The largest portion of the dataset (30.4%) originates from LAION-400M, a large-scale image-text dataset. To identify relevant content, we first selected a list of ImageNet labels that semantically matched our taxonomy. These labels were generated using GPT-4o in a zero-shot setting, using separate prompts per category. This resulted in 194 candidate labels, of which 88.7% were retained after manual review. The structure of the prompts used in this process is shown in the file gpt4o_imagenet_prompting_scheme.png, which illustrates a shared base prompt template applied across all categories. The fields category_definition, file_examples, and exceptions are specified per category. Definitions align with the taxonomy, while the file_examples column includes sample labels retrieved from the ImageNet label list. The exceptions field contains category-specific filtering instructions; a dash indicates no exceptions were specified. Another 25.1% of images were sourced from Roboflow, using open datasets such as: Marijuana and Hemp 200 Drug Detection Plants Classification Weapon Detection Suicide Detection Violence Detection Clasificacionimagenes Waste Recognition FYP The NudeNet dataset contributes 11.4% of the dataset. We sampled 1,000 images from the “porn” category to provide visual coverage of explicit sexual content. Another 11.0% of images were collected from Kaggle, including: National Flowers Weapon Dataset for YOLOv5 GUIE Toys Alcohol Bottle Images Smoking & Drinking Dataset An additional 9.9% of images were retrieved from Unsplash, using keyword-based search queries aligned with each category in our taxonomy. Images from UnsafeBench make up 8.0% of the dataset. Since its original binary labels did not match our taxonomy, all samples were manually reassigned to the most appropriate category. Finally, 4.2% of images were gathered from various publicly accessible websites. These were primarily used to improve category balance and model generalization, especially in safe classes. All images collected from the listed sources have been manually reviewed by three independent annotators. Each image is then assigned to a category when at least two annotators reach consensus. Table 2: Distribution of images per public source and category in SIMAS dataset Type Category LAION Roboflow NudeNet Kaggle Unsplash UnsafeBench Other Total Unsafe Alcohol 29 0 3 267 0 1 0 300 Unsafe Drugs 17 211 0 0 13 8 1 250 Unsafe Firearms 0 59 0 229 0 62 0 350 Unsafe Gambling 132 38 0 0 73 39 18 300 Unsafe Sexual 2 0 421 0 3 68 6 500 Unsafe Tobacco 0 446 0 0 43 11 0 500 Unsafe Violence 0 289 0 0 0 11 0 300 Safe Alcohol 140 35 0 0 16 13 96 300 Safe Drugs 67 49 0 15 72 17 30 250 Safe Firearms 173 15 0 3 144 8 7 350 Safe Gambling 164 2 0 1 121 12 0 300 Safe Sexual 235 22 139 2 0 94 8 500 Safe Tobacco 351 67 5 13 8 16 40 500 Safe Violence 212 20 3 21 0 42 2 300 All All 1,522 1,253 571 551 493 402 208 5,000 Balancing To ensure semantic diversity and dataset balance, undersampling was performed on overrepresented categories using a CLIP-based embedding and k-means clustering strategy. This resulted in a final dataset containing 2,500 spam and 2,500 safe images, evenly distributed across all categories. Table 3: Distribution of images per category in SIMAS dataset Type Alcohol Drugs Firearms Gambling Sexual Tobacco Violence Total Unsafe 300 250 350 300 500 500 300 2,500 Safe 300 250 350 300 500 500 300 2,500 All 600 500 700 600 1,000 1,000 600 5,000 SIMAS+ Dataset For researchers interested in a more realistic deployment setting, we also curate a complementary dataset called SIMAS+. It is a benchmarking dataset containing publicly accessible images extracted from real-world MMS traffic, specifically from external URLs embedded in messages. Manual annotation was conducted by three independent raters, with a category label assigned when at least two annotators agreed. The dataset was then balanced across spam categories using the same semantic grouping strategy as in SIMAS, ensuring equal representation of safe and unsafe examples per class. The final version of SIMAS+ contains 700 images, with the category distribution presented in the table below. Table 4: Distribution of images per category in SIMAS+ dataset Type Alcohol Drugs Firearms Gambling Sexual Tobacco Violence Total Unsafe 100 50 80 50 50 10 10 350 Safe 100 50 80 50 50 10 10 350 All 200 100 160 100 100 20 20 700 Note: Due to regulatory and privacy considerations, SIMAS+ is not included in this archive. To obtain access to the SIMAS+ dataset for research purposes, please contact the dataset authors directly. License This dataset is licensed under the CC BY-NC 4.0 license and may be used for non-commercial research purposes.

Related Organizations

Graz University of Technology
Austria

Keywords

Datasets as Topic/classification

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average