<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

Network Digital Twin-Generated Dataset for Machine Learning-based Detection of Benign and Malicious Heavy Hitter Flows

Name: Network Digital Twin-Generated Dataset for Machine Learning-based Detection of Benign and Malicious Heavy Hitter Flows
Keywords: Machine Learning, Cybersecurity, Heavy Hitter Flows, Artificial Intelligence, Network Digital Twin, Distributed Denial of Service Attacks, Network Traffic Classification, Telecommunication Networks

Research datakeyboard_double_arrow_right Dataset 11 Jul 2024Publisher:Zenodo

Authors: Karamchandani Batra, Amit; Nuñez Fuente, Javier; de la Cal García, Luis; Moreno Meneses, Yenny; Mozo Velasco, Alberto; Pastor Perales, Antonio; R. López, Diego;

doi: 10.5281/zenodo.14134645 , 10.5281/zenodo.14134646 , 10.5281/zenodo.14841650

Network Digital Twin-Generated Dataset for Machine Learning-based Detection of Benign and Malicious Heavy Hitter Flows

- Summary
- Subjects
- Metrics

Abstract

Overview This record provides a dataset created as part of the study presented in the following publication and is made publicly available for research purposes. The associated article provides a comprehensive description of the dataset, its structure, and the methodology used in its creation. If you use this dataset, please cite the following article: A. Karamchandani, J. Nunez, L. de-la-Cal, Y. Moreno, A. Mozo, and A. Pastor, “On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination,” IEEE Communications Magazine, pp. 2–8, 2025, DOI: 10.1109/MCOM.003.2400648. This article was published in IEEE Communications Magazine, one of the most prestigious and influential journals in both academic and industrial circles. According to the 2024 Journal Citation Reports, it has a five-year Impact Factor of 10 and is ranked 12th out of 120 journals in the Telecommunications category, placing it in the 90.83 percentile. More precisely, the record contains several synthetic datasets generated to differentiate between benign and malicious heavy hitter flows within a realistic virtualized network environment. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction. To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic. Feature Set: The feature set includes the following flow statistics commonly used in the literature on network traffic classification: The protocol used for the connection, identifying whether it is TCP, UDP, ICMP, or OSPF. The time (relative to the connection start) of the most recent packet sent from source to destination at the time of each snapshot. The time (relative to the connection start) of the most recent packet sent from destination to source at the time of each snapshot. The cumulative count of data packets sent from source to destination at the time of each snapshot. The cumulative count of data packets sent from destination to source at the time of each snapshot. The cumulative bytes sent from source to destination at the time of each snapshot. The cumulative bytes sent from destination to source at the time of each snapshot. The time difference between the first packet sent from source to destination and the first packet sent from destination to source. Dataset Variations: To accommodate diverse research needs and scenarios, the dataset is provided in the following variations: All at Once: Contains a synthetic dataset where all traffic types, including benign, normal, and malicious DDoS heavy hitter (HH) flows, are combined into a single dataset. This version represents a holistic view of the traffic environment, simulating real-world scenarios where all traffic occurs simultaneously. Balanced Traffic Generation: Represents a balanced traffic dataset with an equal proportion of benign, normal, and malicious DDoS traffic. Designed for scenarios where a balanced dataset is needed for fair training and evaluation of machine learning models. DDoS at Intervals: Contains traffic data where malicious DDoS HH traffic occurs at specific time intervals, mimicking real-world attack patterns. Useful for studying the impact and detection of intermittent malicious activities. Only Benign HH Traffic: Includes only benign HH traffic flows. Suitable for training and evaluating models to identify and differentiate benign heavy hitter traffic patterns. Only DDoS Traffic: Contains only malicious DDoS HH traffic. Helps in isolating and analyzing attack characteristics for targeted threat detection. Only Normal Traffic: Comprises only regular, non-HH traffic flows. Useful for understanding baseline network behavior in the absence of heavy hitters. Unbalanced Traffic Generation: Features an unbalanced dataset with varying proportions of benign, normal, and malicious traffic. Simulates real-world scenarios where certain types of traffic dominate, providing insights into model performance in unbalanced conditions. For each variation, the output of the different packet aggregators is provided separated in its respective folder. Each variation was generated using the NDT approach to demonstrate its flexibility and ensure the reproducibility of our study's experiments, while also contributing to future research on network traffic patterns and the detection and classification of heavy hitter traffic flows. The dataset is designed to support research in network security, machine learning model development, and applications of digital twin technology.

Related Organizations

Universidad Politécnica de Madrid
Spain

Keywords

Machine Learning, Cybersecurity, Heavy Hitter Flows, Artificial Intelligence, Network Digital Twin, Distributed Denial of Service Attacks, Network Traffic Classification, Telecommunication Networks

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

Related to Research communities

Knowmad Institut