Close

Presentation

Lightning Talk: Update on Checkpointing and Localized Recovery for Nested Fork-Join Programs
DescriptionAn important use case for checkpointing is the protection of programs against permanent node failures in clusters. Ideally, checkpoints should be written transparently, i.e., without the need to change the program; and efficiently, i.e., they should include only relevant data. Other frequently desired features are shrinking recovery, i.e., the program should continue to run after failures on the intact resources; and localized recovery, i.e., the recovery should be performed by a few affected processes only.

Asynchronous Many-Task programming (AMT) provides potential to achieve all of the above goals at the same time, since a runtime system can automatically extract relevant data to be checkpointed; and tasks can be re-run after failure. To realize the potential, however, tasks must have clean interfaces, and specialized algorithms are needed for different types of AMT.

One AMT type is nested fork-join programs (NFJ), as exemplified by Cilk. NFJ tasks recursively spawn child tasks. Each task returns a result to its parent and the root task calculates the final result. We disallow side effects in tasks. Many NFJ runtime systems deploy work stealing with a work-first policy.

A task-level checkpointing and recovery scheme for this setting has been presented by the author at the 2021 SuperCheck Workshop [1]. At that time, the algorithm was only outlined, but meanwhile we implemented and experimentally evaluated it [2]. The proposed talk will provide a corresponding update.

Our algorithm uses uncoordinated checkpointing. The checkpoints contain task descriptors, task results, and some status information. They are written to a resilient store. Checkpoints are updated regularly and when tasks are moved due to work stealing, recovery, or result return. Thereby consistency is achieved through specific protocols. Recovery is performed by a buddy and by recent stealing partners. Altogether the algorithm ensures that the program outputs the correct result regardless of the number of failures, even if they occur simultaneously or during recovery. Only when all processes or the resilient store fail, the program aborts with an error message.

We implemented the algorithm by extending a recent cluster implementation of NFJ, which uses one multithreaded process per cluster node, private task pools and lifeline-based victim selection. All threads participate in the work stealing, i.e., any local or remote thread can be the victim of any other. The implementation was done with the "APGAS for Java" programming framework. Moreover, we used a resilient replication-based distributed in-memory store, called IMap, to save our checkpoints. The IMap supports transactions, which are needed by our protocols.

Experiments were run with three benchmarks (Fibonacci, UTS, synthetic benchmark), using up to 1280 threads on 32 homogeneous nodes. We observed an increase of the running time in failure-free runs by up to 28%, and neglectable costs for recovery.

[1] C. Fohry: Checkpointing and Localized Recovery for Nested Fork-Join Programs. Int. Symp. on Checkpointing for Supercomputing, 2021.

[2] L. Reitz, C. Fohry: Task-Level Checkpointing for Nested Fork-Join Programs using Work Stealing. Workshop Proc. Euro-Par, 2023, to appear.
Event Type
Workshop
TimeSunday, 12 November 20234:40pm - 4:50pm MST
Location710
Tags
Fault Handling and Tolerance
Registration Categories
W