
This study presents a scalable and efficient solution for advanced anomaly detection in network traffic using Azure Databricks and machine learning techniques. Modern networks generate massive volumes of traffic data, making manual detection of anomalies or cyber threats challenging. Traditional tools, such as RDBMS and Hadoop, are slow and not designed for real-time security monitoring. To address these challenges, the proposed system utilizes Azure Databricks, a unified cloud platform for big data processing and machine learning. Network traffic logs were cleaned and transformed using PySpark to extract features, such as IP addresses, session duration, data transfer, and packet counts. K-means clustering was then applied to group similar traffic patterns and identify anomalies without the need for labeled data. Model performance was evaluated using the Silhouette Score to ensure meaningful and well-separated clusters. The objective of this study is to provide a comprehensive overview of recent advancements in abnormality detection, focusing on emerging technologies and potential future opportunities. All stages, from data ingestion to anomaly detection, were executed within a single databricks notebook, thus requiring a minimal setup. The system performs efficiently even on low-cost Azure plans, making it accessible to small teams, students, and researchers. This solution enables real-time threat detection, automatic scaling, and quick incident response, offering a faster, smarter, and more cost-effective alternative to traditional network security methods.
Network Traffic, K-Means Clustering, Anomaly Detection, Azure Databricks, Silhouette Score
Network Traffic, K-Means Clustering, Anomaly Detection, Azure Databricks, Silhouette Score
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
