Close

Presentation

File Aggregation for Asynchronous Multi-Level Checkpointing
DescriptionCheckpointing serves numerous functionalities in modern-day HPC systems and applications. In recent years, synchronous checkpointing, which blocks the application until checkpoints are persisted to external storage, suffers rising synchronization overheads at scale, resulting in little forward progress by the application. Therefore, asynchronous checkpointing has become more popular by quickly capturing checkpoints locally and flushing them in the background concurrently alongside the application. State-of-the-art solutions like VELOC utilize a file-per-process strategy, which is difficult for users and parallel file systems to manage. We implement a tunable N-to-M aggregation strategy within VELOC, obtaining 2.5x greater throughput than state-of-the-art aggregation library ADIOS2 and 1.5x higher throughput than the naive N-to-1 aggregation currently supported by VELOC.
Event Type
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TimeWednesday, 15 November 20234:06pm - 4:15pm MST
Location505
Registration Categories
TP