BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T000713Z
LOCATION:710
DTSTART;TZID=America/Denver:20231112T140500
DTEND;TZID=America/Denver:20231112T145000
UID:submissions.supercomputing.org_SC23_sess427_misc308@linklings.com
SUMMARY:AI-Augmented SWARM Based Resilience for Integrate Research Infrast
 ructures
DESCRIPTION:Franck Cappello (Argonne National Laboratory (ANL))\n\nThe com
 munity spent a dozen years developing production-level checkpointing techn
 iques, such as VeloC, capable of capturing and saving extreme volumes of d
 ata with negligible overhead for the exascale scientific parallel applicat
 ion executions. A novel category of systems will emerge within the next te
 n years: Integrated Research Infrastructures. These infrastructures will c
 onnect supercomputers, scientific instrument facilities, large-scale data 
 repositories, and collections of edge devices to form nationwide execution
  environments that users will share to run scientific workflows. The chara
 cteristics of IRIs and the workflow execution constraints raise a new set 
 of unexplored research questions regarding resilience, especially executio
 n state management. In this talk, we will first review the projected chara
 cteristics of IRIs and the user constraints regarding workflow executions.
  One IRI projected characteristic is the practical difficulty (probable im
 possibility) of capturing consistent states of the full system: resilience
  mechanisms will likely need to work only with an approximate system view.
  To address this unique resilience design characteristic, the DOE-funded S
 WARM project will explore a novel resilience approach based on AI-augmente
 d distributed agents where each node of the IRIS runs an agent having a vi
 ew of the system limited to its neighbors. We will review the open researc
 h questions raised by this revolutionary approach (fault types, fault dete
 ction, fault notification, execution state capture, and management) and so
 me potential directions to address them.\n\nTag: Fault Handling and Tolera
 nce\n\nRegistration Category: Workshop Reg Pass\n\nSession Chairs: Gene Co
 operman (Northeastern University); Donglai Dai (Advanced Micro Devices (AM
 D)); Rebecca Hartman-Baker (National Energy Research Scientific Computing 
 Center (NERSC), Lawrence Berkeley National Laboratory (LBNL)); and Bogdan 
 Nicolae (Argonne National Laboratory (ANL), Illinois Institute of Technolo
 gy)\n\n
END:VEVENT
END:VCALENDAR
