
Enhanced complexity, together with high service dependencies and dynamic scaling requirements in present-day cloud environments, create both critical and difficult conditions for quick anomaly detection as well as root cause analysis (RCA). The traditional rule-based monitoring framework cannot discover slight and new types of anomalies that occur before system outages or security breaches. The document examines how AI systems alongside Machine Learning (ML) capabilities combined with deep learning processing of logs, metrics, and traces help automatically detect anomalies while performing RCA operations in cloud-native platforms. The paper examines the utilization of supervised learning with unsupervised and reinforcement methods on diverse telemetry information to perform real-time detection of performance dips and, system errors and anomalous usage patterns. These systems can use AI technology to link distributed system incidents while simultaneously pinpointing foundational problems that human personnel cannot match for speed when recommending solutions. The operational effects of these techniques can be seen through real-life applications at Adobe, Uber, Zalando, and LinkedIn. Automated RCA systems face ethical and technical challenges, according to the paper, which details problems like model drift, interpretability of complex models, and observability gaps. The ongoing expansion of cloud systems makes AI-driven anomaly detection essential for maintaining resilience and optimizing performance and cyber defense for both multi-cloud and hybrid cloud systems.
Observability, Cloud resilience, Traces, Security threats, Machine learning, Root cause analysis, Deep learning, Metrics, Anomaly detection, AI operations, Cloud monitoring, Logs
Observability, Cloud resilience, Traces, Security threats, Machine learning, Root cause analysis, Deep learning, Metrics, Anomaly detection, AI operations, Cloud monitoring, Logs
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 3 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
