
handle: 11311/1166030 , 10044/1/87950
Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.
Technology, Theory & Methods, Numerical weather prediction, numerical weather prediction, Hardware & Architecture, 0805 Distributed Computing, Fault-tolerant computing, application-level resilience, Application-level resilience, Computer Science, Theory & Methods, [INFO.INFO-DC] Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], Interdisciplinary Applications, Computer Science, Hardware & Architecture, iterative solvers, High-performance computing, Science & Technology, 000, high-performance computing, [MATH.MATH-NA] Mathematics [math]/Numerical Analysis [math.NA], 004, Computer Science, Computer Science, Interdisciplinary Applications, MPI, Iterative solvers, Distributed Computing
Technology, Theory & Methods, Numerical weather prediction, numerical weather prediction, Hardware & Architecture, 0805 Distributed Computing, Fault-tolerant computing, application-level resilience, Application-level resilience, Computer Science, Theory & Methods, [INFO.INFO-DC] Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], Interdisciplinary Applications, Computer Science, Hardware & Architecture, iterative solvers, High-performance computing, Science & Technology, 000, high-performance computing, [MATH.MATH-NA] Mathematics [math]/Numerical Analysis [math.NA], 004, Computer Science, Computer Science, Interdisciplinary Applications, MPI, Iterative solvers, Distributed Computing
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 12 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
