Checkpoint Selection in Fault Recovery Based on Byzantine Fault Model

Xinhai Xu; Yufei Lin

Found an issue? Give us feedback

https://doi.org/10.1...arrow_drop_down

https://doi.org/10.1109/cicn.2...

Article . 2012 . Peer-reviewed

Data sources: Crossref

https://dx.doi.org/10.1109/cic...

Article

Data sources: Microsoft Academic Graph

Checkpoint Selection in Fault Recovery Based on Byzantine Fault Model

descriptionPublicationkeyboard_double_arrow_right Article 01 Nov 2012Publisher:IEEEJournal:2012 Fourth International Conference on Computational Intelligence and Communication Networks

Authors: Xinhai Xu; Yufei Lin;

doi: 10.1109/cicn.2012.59

Checkpoint Selection in Fault Recovery Based on Byzantine Fault Model

- Summary
- Metrics

Abstract

Nowadays, with the growth of the performance, the reliability problem of supercomputers becomes more and more serious. In order to complete an application with small fault recovery overhead, Checkpoint/Restart(C/R) methods are widely used. So far, the mainstream C/R methods are either based on Fail-Stop fault model or making the system(or program) do error detection before storing checkpoints, so they can ensure the correctness of every checkpoint. However, the faults occurring in the systems in real world are more accordant with the Byzantine fault model, and in order to pursue the higher practical performance, neither the system nor the program implements any fault detection mechanism. Consequently, there may be errors in the checkpoints. This paper studies the checkpoint selection problem that which checkpoint should be selected as the object of rolling back after system occurring failure, based on Byzantine fault model. We design a framework of checkpoint selection, and then, based on it, propose three checkpoint selection strategies: conservative strategy, aggressive strategy and statistical strategy. The simulation results show that: the conservative strategy shows its superiority when the error latent period is long, while the aggressive strategy behaves oppositely, the statistical strategy has a stable efficiency, only 50% more overhead compared to the ideal checkpoint selection when the checkpoint period is the half of mean time between faults.

Related Organizations

National University of Defense Technology
China (People's Republic of)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

2

Average

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now