Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
versions View all 3 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Network Digital Twin-Generated Dataset for Machine Learning-based Detection of Benign and Malicious Heavy Hitter Flows

Authors: Karamchandani Batra, Amit; Nuñez Fuente, Javier; de la Cal García, Luis; Moreno Meneses, Yenny; Mozo Velasco, Alberto; Pastor Perales, Antonio; R. López, Diego;

Network Digital Twin-Generated Dataset for Machine Learning-based Detection of Benign and Malicious Heavy Hitter Flows

Abstract

Overview This record provides a dataset created as part of the study presented in the following publication and is made publicly available for research purposes. The associated article provides a comprehensive description of the dataset, its structure, and the methodology used in its creation. If you use this dataset, please cite the following article: A. Karamchandani, J. Nunez, L. de-la-Cal, Y. Moreno, A. Mozo, and A. Pastor, “On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination,” IEEE Communications Magazine, pp. 2–8, 2025, DOI: 10.1109/MCOM.003.2400648. This article was published in IEEE Communications Magazine, one of the most prestigious and influential journals in both academic and industrial circles. According to the 2024 Journal Citation Reports, it has a five-year Impact Factor of 10 and is ranked 12th out of 120 journals in the Telecommunications category, placing it in the 90.83 percentile. More precisely, the record contains several synthetic datasets generated to differentiate between benign and malicious heavy hitter flows within a realistic virtualized network environment. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction. To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic. Feature Set: The feature set includes the following flow statistics commonly used in the literature on network traffic classification: The protocol used for the connection, identifying whether it is TCP, UDP, ICMP, or OSPF. The time (relative to the connection start) of the most recent packet sent from source to destination at the time of each snapshot. The time (relative to the connection start) of the most recent packet sent from destination to source at the time of each snapshot. The cumulative count of data packets sent from source to destination at the time of each snapshot. The cumulative count of data packets sent from destination to source at the time of each snapshot. The cumulative bytes sent from source to destination at the time of each snapshot. The cumulative bytes sent from destination to source at the time of each snapshot. The time difference between the first packet sent from source to destination and the first packet sent from destination to source. Dataset Variations: To accommodate diverse research needs and scenarios, the dataset is provided in the following variations: All at Once: Contains a synthetic dataset where all traffic types, including benign, normal, and malicious DDoS heavy hitter (HH) flows, are combined into a single dataset. This version represents a holistic view of the traffic environment, simulating real-world scenarios where all traffic occurs simultaneously. Balanced Traffic Generation: Represents a balanced traffic dataset with an equal proportion of benign, normal, and malicious DDoS traffic. Designed for scenarios where a balanced dataset is needed for fair training and evaluation of machine learning models. DDoS at Intervals: Contains traffic data where malicious DDoS HH traffic occurs at specific time intervals, mimicking real-world attack patterns. Useful for studying the impact and detection of intermittent malicious activities. Only Benign HH Traffic: Includes only benign HH traffic flows. Suitable for training and evaluating models to identify and differentiate benign heavy hitter traffic patterns. Only DDoS Traffic: Contains only malicious DDoS HH traffic. Helps in isolating and analyzing attack characteristics for targeted threat detection. Only Normal Traffic: Comprises only regular, non-HH traffic flows. Useful for understanding baseline network behavior in the absence of heavy hitters. Unbalanced Traffic Generation: Features an unbalanced dataset with varying proportions of benign, normal, and malicious traffic. Simulates real-world scenarios where certain types of traffic dominate, providing insights into model performance in unbalanced conditions. For each variation, the output of the different packet aggregators is provided separated in its respective folder. Each variation was generated using the NDT approach to demonstrate its flexibility and ensure the reproducibility of our study's experiments, while also contributing to future research on network traffic patterns and the detection and classification of heavy hitter traffic flows. The dataset is designed to support research in network security, machine learning model development, and applications of digital twin technology.

Related Organizations
Keywords

Machine Learning, Cybersecurity, Heavy Hitter Flows, Artificial Intelligence, Network Digital Twin, Distributed Denial of Service Attacks, Network Traffic Classification, Telecommunication Networks

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Related to Research communities