
In this work, we study fault tolerance of transient errors, such as those occurring due to cosmic radiation or hardware component aging and degradation, using Algorithm-Based Fault Tolerance (ABFT). ABFT methods typically work by adding some additional computation in the form of invariant checksums which, by definition, should not change as the program executes. By computing and monitoring checksums, it is possible to detect errors by observing differences in the checksum values. However, this is challenging for two key reasons: (1) it requires careful manual analysis of the input program, and (2) care must be taken to subsequently carry out the checksum computations efficiently enough for it to be worth it. Prior work has shown how to apply ABFT schemes with low overhead for a variety of input programs. Here, we focus on a subclass of programs called stencil applications, an important class of computations found widely in various scientific computing domains. We propose a new compilation scheme to automatically analyze and generate the checksum computations. To the best of our knowledge, this is the first work to do such a thing in a compiler. We show that low overhead code can be easily generated and provide a preliminary evaluation of the tradeoff between performance and effectiveness.
stencils, program transformations, polyhedral compilation, [INFO] Computer Science [cs], fault-tolerance
stencils, program transformations, polyhedral compilation, [INFO] Computer Science [cs], fault-tolerance
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 1 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
