Anomaly Detection in the Elasticsearch Service

The Elasticsearch Service is a distributed search and analytics engine widely used across CERN. Currently, issues in the service are resolved manually after being detected through internal monitoring by service managers. However, the number of clusters and metrics are large which makes them difficult to track, and issues are often discovered and reported by users. This is time consuming and disturbs the workflow of the service users. In light of this, the main objective of this project is to develop a model capable of identifying anomalies in the Elasticsearch Service clusters, in order to predict and eliminate service issues before they cause problems. This is done by analyzing the history of cluster data using machine learning methods. In this way, a single metric signaling service issues can be obtained and used to alarm service managers of upcoming issues. In 2017, a deep neural network model was developed for this purpose. However, several issues were identified with the model, the most severe being convergence issues in the autoencoder. In this project, a revised autoencoder based on long short-term memory neural networks (LSTM’s) is developed, tuned and evaluated. Finally, it is used on new Elasticsearch Service cluster data. The final model shows improved convergence compared to the previous model, and is able to detect real service issues based on the anomaly scores obtained. By combining the anomaly scores with those obtained by a model simply predicting the cluster state as a moving average of preceding states, the rate of false positives is reduced. The conclusion is that that a combined model, reporting anomalies based on a combination of the anomaly scores obtained by the LSTM based model and the moving average model, is the most sensitive to real service issues.

Keywords

summer-student programme, CERN openlab

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average