Achieving Target MTTF by Duplicating Reliability-Critical Components in High Performance Computing Systems

descriptionPublicationkeyboard_double_arrow_right Article , Conference object 01 May 2011Publisher:IEEEJournal:2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum

Authors: Nithin Nakka; Alok N. Choudhary; Gary Grider; John Bent; James Nunez; Satsangat Khalsa;

doi: 10.1109/ipdps.2011.311

Achieving Target MTTF by Duplicating Reliability-Critical Components in High Performance Computing Systems

- Summary
- Metrics

Abstract

Mean Time To failure, MTTF, is a commonly accepted metric for reliability. In this paper we present a novel approach to achieve the desired MTTF with minimum redundancy. We analyze the failure behavior of large scale systems using failure logs collected by Los Alamos National Laboratory. We analyze the root cause of failures and present a choice of specific hardware and software components to be made fault-tolerant, through duplication, to achieve target MTTF at minimum expense. Not all components show similar failure behavior in the systems. Our objective, therefore, was to arrive at an ordering of components to be incrementally selected for protection to achieve a target MTTF. We propose a model for MTTF for tolerating failures in a specific component, system-wide, and order components according to the coverage provided. Systems grouped based on hardware configuration showed similar improvements in MTTF when different components in them were targeted for fault-tolerance.

Related Organizations

University of Illinois at Urbana Champaign
United States
University of Illinois System
United States
Northwestern University
United States
Los Alamos National Laboratory
United States
NORTHWESTERN UNIVERSITY
United States

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering