Presentation
AI-Augmented SWARM Based Resilience for Integrate Research Infrastructures
DescriptionThe community spent a dozen years developing production-level checkpointing techniques, such as VeloC, capable of capturing and saving extreme volumes of data with negligible overhead for the exascale scientific parallel application executions. A novel category of systems will emerge within the next ten years: Integrated Research Infrastructures. These infrastructures will connect supercomputers, scientific instrument facilities, large-scale data repositories, and collections of edge devices to form nationwide execution environments that users will share to run scientific workflows. The characteristics of IRIs and the workflow execution constraints raise a new set of unexplored research questions regarding resilience, especially execution state management. In this talk, we will first review the projected characteristics of IRIs and the user constraints regarding workflow executions. One IRI projected characteristic is the practical difficulty (probable impossibility) of capturing consistent states of the full system: resilience mechanisms will likely need to work only with an approximate system view. To address this unique resilience design characteristic, the DOE-funded SWARM project will explore a novel resilience approach based on AI-augmented distributed agents where each node of the IRIS runs an agent having a view of the system limited to its neighbors. We will review the open research questions raised by this revolutionary approach (fault types, fault detection, fault notification, execution state capture, and management) and some potential directions to address them.