descriptionPublicationkeyboard_double_arrow_right Article , Part of book or chapter of book 17 Dec 2002Publisher:IEEE Comput. Soc. PressJournal:Proceedings of IEEE 13th Symposium on Reliable Distributed SystemsFunded by:NSF | NYI: Efficient Fault-Tole...

Authors: Ziv, Avi; Bruck, Jehoshua;

doi: 10.1109/reldis.1994.336909

Analysis of checkpointing schemes for multiprocessor systems

- Summary
- Subjects
- Metrics

Abstract

Parallel computing systems provide hardware redundancy that helps to achieve low cost fault-tolerance, by duplicating the task into more than a single processor, and comparing the states of the processors at checkpoints. This paper suggests a novel technique, based on a Markov reward model (MRM), for analyzing the performance of checkpointing schemes with task duplication. We show how this technique can be used to derive the average execution time of a task and other important parameters related to the performance of checkpointing schemes. Our analytical results match well the values we obtained using a simulation program. We compare the average task execution time and total work of four checkpointing schemes, and show that generally increasing the number of processors reduces the average execution time, but increases the total work done by the processors. However, in cases where there is a big difference between the time it takes to perform different operations, those results can change. >

Related Organizations

Stanford University
United States
Information Systems Laboratories (United States)
United States
California Institute of Technology
United States

Keywords

multiprocessing systems, low cost fault-tolerance, parallel computing systems, Markov processes, Markov reward model, multiprocessor systems, checkpointing schemes, hardware redundancy, simulation program, performance evaluation, 004

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	11
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

Average

Top 10%

Green

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Funded by

NSF| NYI: Efficient Fault-Tolerant Parallel and Distributed Computing