Presentation

· Contributors · Organizations · Search Program · My Schedule · Happening Now · Maps

When to Checkpoint at the End of a Fixed-Length Reservation?

Session13th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2023)

DescriptionConsider an application executing for a fixed duration. The checkpoint duration is a stochastic random variable that obeys some well-known probability distribution law. The question is when to take a checkpoint towards the end of the execution, so that the expectation of the work done is maximized. In the first scenario, a checkpoint can be taken at any time.

We provide the optimal solution for a variety of probability distribution laws modeling checkpoint duration. In the second scenario, the application is a chain of tasks with IID stochastic execution times, and a checkpoint can be taken only at the end of a task. First, we introduce a static strategy where we compute the optimal number of tasks before the checkpoint at the beginning of the execution. Then, we design a dynamic strategy that decides whether to checkpoint or to continue execution at the end of each task.

Author/Presenters

Quentin Barbut

ENS Lyon