Raft Protocol for Fault Tolerance and Self-Recovery in Federated Learning

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type , Conference object 15 Apr 2024Publisher:ACMJournal:Proceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing SystemsFunded by:EC | ERATOSTHENES, EC | ENTRUST, EC | INTEND

Authors: Dautov, Rustem; Husom, Erik Johannes;

doi: 10.1145/3643915.3644093 , 10.5281/zenodo.13384065 , 10.5281/zenodo.13384064

Raft Protocol for Fault Tolerance and Self-Recovery in Federated Learning

- Summary
- Metrics

Abstract

SEAMS '24: Proceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing Systems Pages 110 - 121 https://doi.org/10.1145/3643915.3644093 ABSTRACT Federated Learning (FL) has emerged as a decentralised machine learning paradigm for distributed systems, particularly in edge and IoT environments. However, ensuring fault tolerance and self-recovery in such scenarios remains challenging, because of the centralised model aggregation which acts as a single point of failure. A possible solution to this challenge would rely on the continuous replication of the global FL state across participating nodes and the functional suitability of any node to replace the aggregator in case of failures. These functional requirements can be implemented using one of the existing distributed consensus algorithm, such as Raft. Our approach utilises Raft's leader election and log replication mechanisms to enable automatic stateful recovery after failures and thus to improve fault tolerance. The log replication process efficiently maintains consistency and coherence across distributed FL nodes, ensuring uninterrupted training process and model convergence. This enhances the robustness of the overall FL system, especially in dynamic and unreliable cyber-physical conditions. To demonstrate the viability of our approach, we present a proof-of-concept implementation based on the existing FL framework Flower. We conduct a series of experiments to measure the aggregator re-election time and traffic overheads associated with the state replication. Despite the expected traffic overheads growing with the number of FL nodes, the results demonstrate a resilient self-recovering system capable of withstanding node failures while maintaining model consistency. AUTHORS Rustem Dautov Erik Johannes Husom rustem.dautov@sintef.no erik.johannes.husom@sintef.no SINTEF Digital Oslo, Norway

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	7
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

7

Top 10%

hybrid

Funded by

EC| ERATOSTHENES, EC| ENTRUST, EC| INTEND