Presentation

· Contributors · Organizations · Search Program · My Schedule · Happening Now · Maps

Fault-Tolerance for High-Performance and Big Data Applications: Theory and Practice

DescriptionResilience is a critical issue for large-scale platforms. This tutorial provides a comprehensive survey of fault-tolerant techniques for high-performance and big data applications, with a fair balance between theory and practice. This tutorial is organized across four main topics:

(i) Overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal);

(ii) General-purpose techniques, which include several checkpoints and rollback recovery protocols, replication, prediction, and silent error detection;

(iii) Application-specific techniques, such as user-level in-memory checkpointing, data replication (map-reduce), or fixed-point convergence for iterative applications (back-propagation);

(iv) Practical deployment of fault tolerance techniques with User Level Fault Mitigation (MPI standard extension). Relevant examples will include widely used routines such as Monte-Carlo methods, SPMD stencil, map-reduce, and back-propagation in neural networks.

A step-by-step approach will show how to protect these routines and make them fault-tolerant, using a variety of techniques, in a hands-on session.

The tutorial is open to all SC23 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific and big data applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models. However, basic knowledge of MPI will be helpful for the hands-on session.

Presenters