
Silent Data Corruptions (SDCs) due to defects in computing chips (CPUs, GPUs, AI accelerators) is a critical threat to the quality of large-scale computing in different application domains: cloud computing, high-performance computing, edge computing. Recent public reports by cloud hyperscalers have emphasized that apart from the usual suspects for SDCs (memory, storage, network), the heart of the computations, the processing elements of all types generate an unexpectedly large rate of SDCs which can cause erroneous calculations and severe information loss. We report, in a consolidated form, recent efforts to correlate early microarchitecture-level simulation-based predictions about the likelihood, rates, severity, and root causes of SDCs and large-scale in-field studies in cloud data centers. Early microarchitecture-level prediction of SDC characteristics (susceptible units, workloads, instructions) can shed light to the cryptic problem of SDCs. The findings of a diligent pre-silicon analysis can assist better understanding of SDCs and can thus drive effective protection decisions either at the hardware or at the software levels at deployment stages.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 1 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
