
This dataset, split across three parts due to Zenodo's size constraints, serves as a fundamental resource for enhancing webpage classification techniques. It encompasses 1,069,715 URLs, each annotated with labels to signify their categorization into Malicious, Benign, or Adult content, and further into 20 detailed sublabels for granular analysis. The dataset is designed to facilitate the evaluation and benchmarking of machine learning models, notably Stochastic Gradient Descent (SGD) and Support Vector Classifier (SVC), across a variety of tokenization methods and input types, including URLs, raw HTML, and parsed HTML content. The primary objective of assembling this dataset is to support research into effective webpage classification, thereby improving content prioritization and filtering in web crawling applications. It has been meticulously curated to provide a robust framework for studying the impact of different feature representation techniques on classification accuracy. The dataset is structured as JSON lines (jsonl) files, with each entry detailing a URL's label, sublabel, source, status code, and HTML content. This comprehensive dataset is divided into three parts due to size constraints on Zenodo, each targeting specific content categories to ensure ease of use and accessibility for researchers: Part 1: Adult & Malicious encompasses URLs classified under Adult and Malicious categories, offering insights into content that requires stringent filtering. Part 2: Benign 1 and Part 3: Benign 2 cover benign URLs, facilitating the study of safe web content and its classification nuances. We also created a .csv file without the HTML content so it is easier to work with URLs only, this .csv file contains the next columns `['uid', 'url', 'label', 'sublabel']` By providing this dataset, we aim to contribute significantly to the field of webpage classification, offering a valuable asset for researchers and practitioners looking to advance the state of web crawling technology and its applications. JSON line format for each line: {"url": "", "label": "", "sublabel": "", "source": "", "status_code": , "html": ""} Other parts of this dataset: A Comprehensive Dataset for Webpage Classification (Part 1: Adult & Malicious) A Comprehensive Dataset for Webpage Classification (Part 2: Benign 1) Citation if you use this dataset, please cite us: Al-Maamari, M., Istaiti, M., Zerhoudi, S., Dinzinger, M., Granitzer, M. and Mitrovic, J., A COMPREHENSIVE DATASET FOR WEBPAGE CLASSIFICATION. https://ca-roll.github.io/downloads/A_Comprehensive_Dataset_for_Webpage_Classification.pdf Granitzer, M., Voigt, S., Fathima, N.A., Golasowski, M., Guetl, C., Hecking, T., Hendriksen, G., Hiemstra, D., Martinovič, J., Mitrović, J. and Mlakar, I., 2023. Impact and development of an Open Web Index for open web search. Journal of the Association for Information Science and Technology. https://doi.org/10.1002/asi.24818
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
