Predicting Failures of Autoscaling Distributed Applications

descriptionPublicationkeyboard_double_arrow_right Article 12 Jul 2024 Italy English Publisher:Association for Computing Machinery (ACM)Journal:Proceedings of the ACM on Software Engineering, volume 1, pages 1,960-1,981 (eissn: 2994-970X,

Copyright policy )

Authors: Giovanni Denaro; Noura El Moussa; Rahim Heydarov; Francesco Lomio; Mauro Pezzè; Ketai Qiu;

doi: 10.1145/3660794

handle: 10281/550343

Predicting Failures of Autoscaling Distributed Applications

- Summary
- Subjects
- Metrics

Abstract

Predicting failures in production environments allows service providers to activate countermeasures that prevent harming the users of the applications. The most successful approaches predict failures from error states that the current approaches identify from anomalies in time series of fixed sets of KPI values collected at runtime. They cannot handle time series of KPI sets with size that varies over time. Thus these approaches work with applications that run on statically configured sets of components and computational nodes, and do not scale up to the many popular cloud applications that exploit autoscaling. This paper proposes P reface , a novel approach to predict failures in cloud applications that exploit autoscaling. P reface originally augments the neural-network-based failure predictors successfully exploited to predict failures in statically configured applications, with a R ectifier layer that handles KPI sets of highly variable size as the ones collected in cloud autoscaling applications, and reduces those KPIs to a set of rectified-KPIs of fixed size that can be fed to the neural-network predictor. The P reface R ectifier computes the rectified-KPIs as descriptive statistics of the original KPIs, for each logical component of the target application. The descriptive statistics shrink the highly variable sets of KPIs collected at different timestamps to a fixed set of values compatible with the input nodes of the neural-network failure predictor. The neural network can then reveal anomalies that correspond to error states, before they propagate to failures that harm the users of the applications. The experiments on both a commercial application and a widely used academic exemplar confirm that P reface can indeed predict many harmful failures early enough to activate proper countermeasures.

Country

Italy

Related Organizations

University of Milano-Bicocca
Italy
Universita della Svizzera Italiana
Switzerland

Keywords

Failure Prediction, Fault Localization, Kubernetes

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

2

Top 10%

Average

Green

gold