Close

Presentation

Checkpoint/Restart for CUDA Kernels
DescriptionIn HPC clusters, it has become common to employ Checkpoint/Restart, that is, saving the execution state of applications in order to restore their computational progress at a later point in time. The benefits of this technique for clusters include more flexibility when reacting to changing workloads and an increased fault tolerance. While many clusters already benefit from C/R tools for traditional CPU applications, there is a lack of comparable tools enabling preemptive and transparent C/R for heterogeneous computing, where applications execute partly on accelerator devices, such as GPUs. This is despite the increasing use of GPUs as accelerators in HPC clusters. Therefore, we propose a novel C/R tool that enables saving the execution state of CUDA kernels, thus allowing preemptive C/R of GPU. We show that full-featured C/R for NVIDIA GPUs is possible despite the proprietary nature of the hardware and software of these devices.
Event Type
Workshop
TimeSunday, 12 November 20233:25pm - 3:50pm MST
Location710
Tags
Fault Handling and Tolerance
Registration Categories
W