
In this paper, we propose a new technique to distinguish the reason for program failure between hardware malfunctions and program bugs, which mitigates the impact of shorter mean time between failures to the debugging process on the future exa-scale supercomputers and improves the productivity of debugging large-scale parallel programs. Our technique detects program failures by observing the abnormal message passing behaviors with distributed monitors and leverages event-driven mechanism to trigger global status checking among different node groups concurrently. Besides, both coarse-grained execution snapshots and fine-grained failure events can be provided for further failure diagnosis and bug analysis. We implement this technique as a user-space library named failure cause resolver (FCR). Experimental results on the Tianhe-2 supercomputer demonstrate that the latency of FCR for failure detection is acceptable with negligible overhead. In addition, FCR does not require administrative privilege and can be easily integrated into existing large-scale parallel programs.
Failure detection, hardware malfunction, Electrical engineering. Electronics. Nuclear engineering, parallel program bug, TK1-9971
Failure detection, hardware malfunction, Electrical engineering. Electronics. Nuclear engineering, parallel program bug, TK1-9971
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 3 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
