publication . Thesis

Towards efficient error detection in large-scale HPC systems

Gurumdimma, Nentawe;
Open Access English
  • Country: United Kingdom
Abstract
The need for computer systems to be reliable has increasingly become important as the dependence on their accurate functioning by users increases. The failure of these systems could very costly in terms of time and money. In as much as system's designers try to design fault-free systems, it is practically impossible to have such systems as different factors could affect them. In order to achieve system's reliability, fault tolerance methods are usually deployed; these methods help the system to produce acceptable results even in the presence of faults. Root cause analysis, a dependability method for which the causes of failures are diagnosed for the purpose of c...
Subjects
free text keywords: QA76
Related Organizations
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue