BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20240116T191703Z
LOCATION:710
DTSTART;TZID=America/Denver:20231112T161500
DTEND;TZID=America/Denver:20231112T164000
UID:submissions.supercomputing.org_SC23_sess427_ws_scsc104@linklings.com
SUMMARY:Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibil
 ity using Checkpoint History Analytics
DESCRIPTION:Workshop\n\nKevin Assogba (Rochester Institute of Technology),
  Bogdan Nicolae (Argonne National Laboratory (ANL)), Huub Van Dam (Brookha
 ven National Laboratory), and M. Mustafa Rafique (Rochester Institute of T
 echnology)\n\nHigh-performance computing applications are increasingly int
 egrating checkpointing libraries for reproducibility analytics. However, c
 apturing an entire checkpoint history for reproducibility study faces the 
 challenges of high-frequency checkpointing across thousands of processes. 
 As a result, the runtime overhead affects application performance and inte
 rmediate results when interleaving is introduced during floating-point cal
 culations. In this paper, we extend asynchronous multi-level checkpoint/re
 start to study the intermediate results generated from scientific workflow
 s. We present an initial prototype of a framework that captures, caches an
 d compares checkpoint histories from different runs of a scientific applic
 ation executed using identical input files. We also study the impact of ou
 r proposed approach by evaluating the reproducibility of classical molecul
 ar dynamics simulations executed using the NWChem software. Experiment res
 ults show that our proposed solution improves the checkpoint write bandwid
 th when capturing checkpoints for reproducibility analysis by a minimum of
  30x and up to 211x compared to the default NWChem checkpointing approach.
 \n\nTag: Fault Handling and Tolerance\n\nRegistration Category: Workshop R
 eg Pass\n\nSession Chairs: Gene Cooperman (Northeastern University); Dongl
 ai Dai (X-ScaleSolutions); Rebecca Hartman-Baker (National Energy Research
  Scientific Computing Center (NERSC), Lawrence Berkeley National Laborator
 y (LBNL)); and Bogdan Nicolae (Argonne National Laboratory (ANL))
END:VEVENT
END:VCALENDAR
