A Fault Tolerance Protocol with Fast Fault Recovery

descriptionPublicationkeyboard_double_arrow_right Article , Conference object 01 Jan 2007Publisher:IEEEJournal:2007 IEEE International Parallel and Distributed Processing Symposium

Authors: Sayantan Chakravorty; Laxmikant V. Kalé;

doi: 10.1109/ipdps.2007.370310

A Fault Tolerance Protocol with Fast Fault Recovery

- Summary
- Metrics

Abstract

Fault tolerance is an important issue for large machines with tens or hundreds of thousands of processors. Checkpoint-based methods, currently used on most machines, rollback all processors to previous checkpoints after a crash. This wastes a significant amount of computation as all processors have to redo all the computation from that checkpoint onwards. In addition, recovery time is bound by the time between the last checkpoint and the crash. Protocols based on message logging avoid the problem of rolling back all processors to their earlier state. However, the recovery time of existing message logging protocols is no smaller than the time between the last checkpoint and crash. We present a fault tolerance protocol, in this paper, that provides fast restarts by using the ideas of message logging and object-based processor virtualization. We evaluate our implementation of the protocol in the Charm++/adaptive MPI runtime system. We show that our protocol provides fast restarts and, for many applications, has low fault-free overhead.

Related Organizations

University of Illinois System
United States
University of Illinois at Urbana Champaign
United States

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	35
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

35

Average

Top 10%

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering