Implementation-Oblivious Transparent Checkpoint-Restart for MPI
DescriptionThis work presents experience with traditional use cases of checkpointing on a novel platform. A single codebase (MANA) transparently checkpoints production workloads for major, available MPI implementations: "develop once, run everywhere''. The new platform allows application developers to compile their application against any of the available standards-compliant MPI implementations, and test each MPI implementation according to performance or other features.
Since its original academic prototype, MANA has been under development for three of the past four years, and is planned to enter full production at NERSC in early Fall of 2023. To the best of the authors' knowledge, MANA is currently the only production-capable, system-level checkpointing package running on a large supercomputer (Perlmutter at NERSC) using a major MPI implementation (HPE Cray MPI). Experiments are presented on several large production workloads, showing low runtime overhead with one codebase supporting four MPI implementations: HPE Cray MPI, MPICH, Open MPI, and ExaMPI.
Since its original academic prototype, MANA has been under development for three of the past four years, and is planned to enter full production at NERSC in early Fall of 2023. To the best of the authors' knowledge, MANA is currently the only production-capable, system-level checkpointing package running on a large supercomputer (Perlmutter at NERSC) using a major MPI implementation (HPE Cray MPI). Experiments are presented on several large production workloads, showing low runtime overhead with one codebase supporting four MPI implementations: HPE Cray MPI, MPICH, Open MPI, and ExaMPI.