Search

XO/EX

Exhibits

Flash Session

XO/EX

Exhibits

Flash Session

XO/EX

Exhibits

Flash Session

XO/EX

Exhibits

Flash Session

XO/EX

Exhibits

Flash Session

XO/EX

Exhibits

Flash Session

XO/EX

Exhibits

SCinet

TUT

XO/EX

Exhibits

Flash Session

XO/EX

Exhibits

Flash Session

Breaking Boundaries: Distributed Domain Decomposition with Scalable Physics-Informed Neural PDE Solvers

XO/EX

Paper

Artificial Intelligence/Machine Learning

Applications

Modeling and Simulation

State of the Practice

DescriptionMosaic Flow is a novel domain decomposition method designed to scale physics-informed neural PDE solvers to large domains. Its unique approach leverages pre-trained networks on small domains to solve partial differential equations on large domains purely through inference, resulting in high reusability. This paper presents an end-to-end parallelization of Mosaic Flow, combining data parallel training and domain parallelism for inference on large-scale problems. By optimizing the network architecture and data parallel training, we significantly reduce the training time for learning the Laplacian operator to minutes on 32 GPUs. Moreover, our distributed domain decomposition algorithm enables scalable inferences for solving the Laplace equation on domains 4096x larger than the training domain, demonstrating strong scaling while maintaining accuracy on 32 GPUs. The reusability of Mosaic Flow, combined with the improved performance achieved through the distributed-memory algorithms, makes it a promising tool for modeling complex physical phenomena and accelerating scientific discovery.

Workshop

Bridging the Quantum Gap: Addressing Challenges in Training Individuals in Quantum Computing Using Self-Guided Learning Resources

Education

State of the Practice

DescriptionThe convergence of quantum technologies and high-performance computing offers unique opportunities for research and algorithm development, demanding a skilled workforce to harness the quantum systems' potential. In this lightning talk, we address the growing need to train experts in quantum computing and explore the challenges in training these individuals in quantum computing, including the abstract nature of quantum theory, or the focus on specific frameworks. To overcome these obstacles, we propose self-guided learning resources that offer interactive learning experiences and practical framework-independent experimentation for different target audiences.

Paper

Bringing Order to Sparsity: A Sparse Matrix Reordering Study on Multicore CPUs

Accelerators

Applications

Graph Algorithms and Frameworks

Performance Measurement, Modeling, and Tools

Programming Frameworks and System Software

DescriptionMany real-world computations involve sparse data structures in the form of sparse matrices. A common strategy for optimizing sparse matrix operations is to reorder a matrix to improve data locality. However, it's not always clear whether reordering will provide benefits over the unordered matrix, as its effectiveness depends on several factors, such as structural features of the matrix, the reordering algorithm and the hardware that is used. This paper aims to establish the relationship between matrix reordering algorithms and the performance of sparse matrix operations. We thoroughly evaluate six different matrix reordering algorithms on 490 matrices across eight multicore architectures, focusing on the commonly used sparse matrix-vector multiplication (SpMV) kernel. We find that reordering based on graph partitioning provides better SpMV performance than the alternatives for a large majority of matrices, and that the resulting performance is explained through a combination of data locality and load balancing concerns.

Invited Talk

Broadening Participation in HPC: Together We Can Make a Difference

Education

HPC in Society

DescriptionAchievements in high-performance computing (HPC) ─ including computational and data-enabled science, analytics, learning, and artificial intelligence (AI) ─ drive progress in science and technology throughout our world. For example, collaborators in the U.S. Department of Energy (DOE) Exascale Computing Project (ECP) are pushing advances across a compelling range of scientific and engineering disciplines by pioneering a robust ecosystem of software technologies that exploit cutting-edge exascale computer architectures.

In order for the HPC community to address the most urgent scientific and societal challenges of the 21st century, the HPC workforce must embody a wide range of skills and perspectives … fully reflecting the diversity of society, including traditionally underrepresented communities — Black or African American, Hispanic/Latinx, Native American, Alaska Native, Native Hawaiian, Pacific Islanders, women, persons with disabilities, and first-generation scholars.

Each of us can make important contributions to broadening participation in HPC. This presentation will provide an overview of a variety of workforce efforts throughout the HPC community and opportunities for involvement. We will discuss the contributions of DOE lab staff who are working as part of the ECP Broadening Participation Initiative to address DOE workforce challenges through a lens that considers the distinct needs and culture of high-performance computing. Activities focus on three complementary thrusts: (1) Establishing an HPC Workforce Development and Retention Action Group to foster a supportive and inclusive culture in DOE labs and communities; (2) expanding the Sustainable Research Pathways (SRP) internship and workforce development program as a multi-lab cohort of students from underrepresented groups (and faculty working with them), who collaborate with DOE lab staff on world-class R&D projects; and (3) creating the Intro to HPC Bootcamp, an immersive program designed to engage students in energy justice using project-based pedagogy and real-life science stories to teach foundational skills in HPC, scalable AI, and analytics while exposing students to the excitement of DOE mission-driven team science. The presentation will highlight the first bootcamp (a collaboration among staff from advanced computing facilities at Argonne, Lawrence Berkeley, and Oak Ridge National Labs, Sustainable Horizons Institute, the DOE Office of Economic Impact and Diversity, and academic partners), which took place in August 2023 and featured a variety of HPC energy justice projects inspired by the DOE Justice40 Initiative. We will also consider challenges and opportunities for future work to broaden participation in HPC.

Exhibits

Flash Session

Building the Right AI Infrastructure for Your Organization

XO/EX

DescriptionJoin speakers from NVIDIA and Arc Compute as they discuss solutions to the everyday challenges organizations face when building AI infrastructure and learn how Arc Compute's turnkey, end-to-end AI solutions, powered by NVIDIA GPUs and networking, are game changers helping decision-makers design, procure, and deploy their AI infrastructure.

Workshop

CachedArrays: API and Framework to Optimize Data Movement for Heterogeneous Memory Systems

Data Movement and Memory

Heterogeneous Computing

Programming Frameworks and System Software

DescriptionWe propose a new framework called CachedArrays and a set of APIs to address the data tiering problem in large scale heterogeneous and disaggregated memory systems. The proposed framework operates at a variable size object granularity and allows the programmer to specify semantic hints about future use of data via a Policy API, which are used by a Data Manager to choose when and where to place a particular data object using a data management API, thus bridging the semantic gap between the programmer and the platform-specific hardware details, and optimizing overall performance. We evaluate the proposed framework on a real hardware platform with terabytes of memory consisting of NVRAM and DRAM on large scale ML training workloads such CNNs, DNNs, and DLRM that exhibit different data access and usage patterns.

Workshop

CAFCW23 – Afternoon Break

Applications

State of the Practice

Workshop

CAFCW23 – Announcements

Applications

State of the Practice

Paper

Calculon: a Methodology and Tool for High-Level Codesign of Systems and Large Language Models

Artificial Intelligence/Machine Learning

Codesign

Performance Optimization

Programming Frameworks and System Software

DescriptionThis paper presents a parameterized analytical performance model of transformer-based Large Language Models (LLMs) for guiding high-level algorithm-architecture codesign studies. This model derives from an extensive survey of performance optimizations that have been proposed for the training and inference of LLMs; the model's parameters capture application characteristics, the hardware system, and the space of implementation strategies. With such a model, we can systematically explore a joint space of hardware and software configurations to identify optimal system designs under given constraints, like the total amount of system memory. We implemented this model and methodology in a Python-based open-source tool called Calculon. Using it, we identified novel system designs that look significantly different from current inference and training systems, showing quantitatively the estimated potential to achieve higher efficiency, lower cost, and better scalability.

Workshop

Canopie-HPC

DescriptionThe ongoing revolution enabled via containerization, virtualization, and new orchestration models has dramatically changed how applications and services are delivered and managed across the computing industry. This revolution has established a new ecosystem of tools and techniques with new, flexible and agile approaches, and continues to gain traction in the HPC community. In addition to HPC-optimized container runtimes, emerging technologies like Kubernetes create a new set of opportunities and challenges. While adoption is growing, questions regarding best practices, foundational concepts, tools, and standards remain. Our goal is to promote the adoption of these tools and introspect the impact of this new ecosystem on HPC use cases. This workshop serves as a key venue for presenting late-breaking research, sharing experiences and best practices, and fostering collaboration in this field. Our fifth workshop iteration will continue to emphasize real-world experiences and challenges in adopting and optimizing these new approaches for HPC.

Workshop

Canopie-HPC – Morning Break

Workshop

CANOPIE-HPC Community Discussion/Open Q&A

Workshop

CANOPIE-HPC– Introduction and Welcome

Workshop

CARAT KOP: Toward Protecting the Core HPC Kernel from Linux Kernel Modules

Middleware and System Software

Programming Frameworks and System Software

Runtime Systems

DescriptionExtending Linux through kernel modules offers immense potential benefits and capabilities for HPC. Deployment is also more likely since Linux is typically the only supported vendor OS. However, because Linux is monolithic, kernel modules are free to access any address with maximum permissions. A poorly written---or untrustworthy---module can wreak havoc. This makes it hard to justify including custom kernel modules in production HPC systems. We address this limitation using the previously developed compiler- and runtime-based address translation (CARAT) model and toolchain, which injects guards around memory accesses. The accesses are then allowed/disallowed according to a policy. We share our results regarding the guard injection and address validation process. Our CARAT-based Kernel Object Protection (CARAT KOP) prototype is able to transform a substantial production kernel module from the kernel tree (a NIC driver comprising ~19,000 lines of code). The transformed module runs with minimal effect on its performance.

Panel

Carbon-Neutrality, Sustainability, and HPC

Energy Efficiency

Green Computing

Sustainability

DescriptionWhat does it mean for computer systems to be sustainable? We have made significant improvements to operational efficiency in HPC systems. We now need to consider a broader scope of environmental impacts across the life cycle of our systems. This includes how they are designed and manufactured, how they are transported, how they are operated and how we are tearing them down, re-using and recycling them after they are no longer useful. These considerations may not be obvious. For example, manufacturing costs dominate the life cycle carbon footprint of systems and that trend is on the rise. How can we start to consider the carbon footprint across the end to end life cycle of our systems? We have a lot of capabilities to understand the performance, power and energy of our systems, but the same cannot be said for carbon footprint. Should carbon footprint be a first order optimization target?

Early Career Program

Inclusivity

Career Paths

Inclusivity

DescriptionFinding the right career path early may be one of the most rewarding discoveries in a young professional's life. This panel discussion will feature insightful stories and kernels of wisdom of four panelists whose diverse careers span from start-ups to large companies, non-profit organizations to universities, and government labs to government agencies. They offer their practical wisdom to present a broader picture of the different workplaces in the HPC community. It will help young individuals to better match their strengths and objectives to the challenges and rewards of the different work places.

Workshop

CaRV – Accelerating Program Optimization through Capture, Replay, Validate

Programming Frameworks and System Software

DescriptionWe present a new methodology and tool that speeds up the process of optimizing science and engineering programs. The tool, called CaRV (Capture, Replay, and Validate), enables users to experiment quickly with large applications, comparing individual program sections before and after optimizations in terms of efficiency and accuracy. Using language-level checkpointing techniques, CaRV captures the necessary data for replaying the experimental section as a separate execution unit after the code optimization and validating the optimization against the original program. The tool reduces the amount of time and resources spent on experimentation with long-running programs by up to two orders of magnitude, making program optimization more efficient and cost-effective.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Case Study for Performance Portability of GPU Programming Frameworks for Hemodynamic Simulations

XO/EX

DescriptionPreparing for the deployment of large scientific and engineering codes on GPU-dense exascale systems is made challenging by the unprecedented diversity of vendor hardware and programming model alternatives for offload acceleration. To leverage the exaflops of GPUs from Frontier (AMD) and Aurora (Intel), users of high performance computing (HPC) legacy codes originally written to target NVIDIA GPUs will have to make decisions with implications regarding porting effort, performance, and code maintainability. To facilitate HPC users navigating this space, we have established a pipeline that combines generalized GPU performance models with proxy applications to evaluate the performance portability of a massively parallel computational fluid dynamics (CFD) code in CUDA, SYCL, HIP, and Kokkos with backends on current NVIDIA-based machines as well as testbeds for Aurora (Intel) and Frontier (AMD). We demonstrate the utility of predictive models and proxy applications in gauging performance bounds and guiding hand-tuning efforts.

Workshop

Catalyzing Research Software Engineering (RSE) Adoption in Underrepresented Regions: Harnessing the Power of Bioinformatics Communities

Software Engineering

Workshop

CDER Announcements and Closing

Education

State of the Practice

DescriptionCDER Announcements and Closing Announcements

Workshop

Centralized Provisioning of Large Language Models for a Research Community

Programming Frameworks and System Software

DescriptionLocal support for large language models (LLMs) in a research community can address unique technological and procedural challenges that arise in an academic setting. Platforms providing multi-GPU nodes, typically found in a centralized computational resource, such as a university datacenter, can manage the large memory footprint of the open-source LLMs. Customizations employing peripheral frameworks help extend the capabilities of these models. Further, the local implementation addresses the protection of researcher IP and control of restricted data sources. This report describes recent efforts toward provisioning this popular new tool and provides guidance for recreating our approach at Arizona State University.

Workshop

Challenge on Extreme-Hetero Application Programming

Large Scale Systems

Middleware and System Software

Programming Frameworks and System Software

DescriptionGPU-centric accelerated supercomputing is still on the main stream for HPC and AI applications. However, in the next generation's systems, we need to consider wider variety of accelerators in different style of systems from the architecture level. On such complicated systems, what is the best way of programming keeping the balance between programmability/productivity and performance? We have been working on the multi-hetero accelerated environment to combine GPU and FPGA in a single platform to apply complicated multiphysics applications with 360-degree manner of utilization of accelerating devices. There are several approaches from the naive implementation to the high-level directive-base approach. In this talk, I will present the programming model, supporting language system, and target applications with the implementation on a real system.

Workshop

Challenges and Opportunities in the Co-Design of Convolutions and RISC-V Vector Processors

Architecture and Networks

Hardware Technologies

DescriptionThe RISC-V "V" extension introduces vector processing to the RISC-V architecture. Unlike most SIMD extensions, it supports long vectors which can result in significant improvement of multiple applications. We present our ongoing research to implement and optimize a vectorized Winograd algorithm used in convolutional layers on RISC-V Vector(RISC-VV) processors. Our study identifies effective techniques for optimizing the kernels of Winograd on RISC-VV using intrinsic instructions, and showcases how certain instructions offer better performance. Our co-design findings suggest that the Winograd algorithm benefits from vector lengths up to 2048 bits and cache sizes up to 64MB.

We use our experience with Winograd to highlight potential enhancements for the standard that would simplify code generation and aid low-level programming. Finally, we share our experience from experimenting with forks of gem5 for RISC-VV and stress the importance of a mature software ecosystem, to facilitate design space exploration and architectural optimization.

Workshop

Chameleon: A Disaggregated CPU, GPU, and FPGA System for Retrieval-Augmented Language Models

Architecture and Networks

DescriptionA Retrieval-Augmented Language Model (RALM) augments a generative language model by retrieving context-specific knowledge from an external database via vector search. This strategy facilitates impressive text generation quality even with smaller models, thus saving orders of magnitude of computational resources compared to large language models such as GPT4. However, RALMs introduce significant challenges to system designs due to the diverse workload characteristics of the different RALM components. In this presentation, we present Chameleon, a heterogeneous system that combines CPUs, GPUs, and FPGAs in a disaggregated manner for efficient RALM serving. While GPUs still manage the computationally-intensive model inference, we design a distributed CPU-FPGA engine for large-scale vector search requiring substantial memory capacity and rapid quantized vector decoding, with the CPU server managing the vector index and FPGA-based disaggregated memory nodes scanning database vectors using near-memory accelerators. Chameleon vector search achieves 8.6~29.4x and 1.6~57.9x lower latency than CPU and GPU-based systems.

Posters

Research Posters

Characterizing GPU Effectiveness on NRP for IceCube fp32 Compute

XO/EX

DescriptionThe IceCube Neutrino Observatory is a cubic kilometer neutrino telescope located at the geographic South Pole. Understanding detector systematic effects is a continuous process. This requires the Monte Carlo simulation to be updated periodically to quantify potential changes and improvements in science results with more detailed modelling of the systematic effects. IceCube’s largest systematic effect comes from the optical properties of the ice the detector is embedded in. Over the last few years there have been considerable improvements in the understanding of the ice, which require a significant processing campaign to update the simulation. In winter 2023, the NRP project offered to provide the needed GPU compute to IceCube in support of this activity. Given the mostly uniform nature of such a simulation campaign, we thus have enough statistics to properly characterize the relative performance of the dozen GPU models present in the NRP in the context of IceCube.

Posters

Research Posters

Characterizing One-/Two-Sided Designs in OpenSHMEM Collectives

XO/EX

DescriptionOpenSHMEM is a widely used Partitioned Global Address Space (PGAS) programming model in the HPC community. The latest OpenSHMEM Specification v1.5 introduced the team concept and team-based collective communication that are similar to the communicator and collective communication in the Message Passing Interface (MPI) programming model. However, the typical design of OpenSHMEM collectives relies on one-sided communication such as PUT and Get to move the data, which is different from two-sided communication in MPI collectives. In this work, we compare OpenSHMEM collective designs using native one-sided communication and MPI-based two-sided communication on an HPC cluster. We characterize two aspects (i.e., synchronization and collective algorithms) that can influence the performance of these two different designs and use benchmarks to show the performance differences. Through our evaluation, we find that the MPI-based design is faster than the one-sided design at most times, while the one-sided design can perform faster in certain cases.

Posters

Research Posters

Characterizing the Performance of the Implicit Massively Parallel Particle-in-Cell iPIC3D Code

XO/EX

DescriptionOptimizing iPIC3D, an implicit Particle-in-Cell (PIC) code, for large-scale 3D plasma simulations is crucial for space and astrophysical applications. This work focuses on characterizing iPIC3D’s communication efficiency through strategic measures like optimal node placement, communication and computation overlap, and load balancing. Profiling and tracing tools are employed to analyze iPIC3D’s communication efficiency and provide practical recommendations. Implementing optimized communication protocols addresses the Geospace Environmental Modeling (GEM) magnetic reconnection challenges in plasma physics with more precise simulations. This approach captures the complexities of 3D plasma simulations, particularly in magnetic reconnection, advancing space and astrophysical research.

Workshop

Characterizing the Performance of Triangle Counting on Graphcore's IPU Architecture

Accelerators

Artificial Intelligence/Machine Learning

Compilers

Heterogeneous Computing

Programming Frameworks and System Software

Runtime Systems

DescriptionIn recent years, we have seen an emergence of novel spatial architectures to accelerate domain-specific workloads like Machine Learning. There is a need to investigate their performance characteristics for traditional HPC workloads for their tighter integration with current and future heterogeneous compute resources. In this work, we implement, optimize and evaluate a parallel algorithm for Triangle Counting for graphs in Bulk Synchronous Parallel (BSP) model for Graphcore’s IPU architecture as well as discuss lessons learned. This study demonstrates IPU's competency in handling such irregular workloads by providing an average speedup of up to 5.3x over NVIDIA A100 GPU on real-world datasets.

Doctoral Showcase

Posters

Charged Particle Track Reconstruction Algorithms for Massively Parallel Systems

Accelerators

Applications

DescriptionThe reconstruction of the trajectories of charged particles through detector experiments is a core computational task in the domain of high-energy physics. Upcoming upgrades to accelerators such as the Large Hadron Collider as well as to experiments like ATLAS threaten to render existing CPU-based approaches to track reconstruction insufficient, and the use of massively parallel systems - GPGPUs in particular - is an important opportunity to meet future data processing requirements. In my thesis, I investigate the feasibility of GPGPU-based track reconstruction from performance engineering perspective: I focus on structured analysis of application performance, the development of statistical and analytical models of performance, methods for mitigating the challenges of GPGPU programming, and the design and implementation of novel track reconstruction algorithms. The key contributions of my thesis include the development of novel algorithms for hit clustering, seed finding, and combinatorial Kalman filtering, key parts of the track reconstruction process. These algorithms suffer from significant load imbalance and thread divergence, and I have developed a novel statistical method for estimating the performance effects of this, as well as to guide optimization through thread refinement and coarsening. I have developed a method for the automated design space exploration of data storage methods for magnetic fields, which play a crucial role in track reconstruction. Furthermore, I have developed an evolutionary method for finding layouts for multi-dimensional arrays in hierarchical memory systems. My thesis will be concluded by a comprehensive study of the performance of track reconstruction, as guided by the aforementioned research.

Workshop

Charliecloud’s Layer-Free, Git-Based Container Build Cache

DescriptionA popular approach to deploying scientific applications in high performance computing (HPC) is Linux containers, which package an application and all its dependencies as a single unit. This image is built by interpreting instructions in a machine-readable recipe, which is faster with a build cache that stores instruction results for re-use. The standard approach (used e.g. by Docker and Podman) is a many-layered union filesystem, encoding differences between layers as tar archives.

We describe a new approach, implemented in Charliecloud: store changing images in a Git repository. Our experiments show this performs similarly to layered caches on both build time and disk usage, with a considerable advantage for many-instruction recipes. Our approach also has structural advantages: better diff format, lower cache overhead, and better file de-duplication. These results show that a Git-based cache for layer-free container implementations is not only possible but may outperform the layered approach on important dimensions.

Workshop

CHARM-SYCL: New Unified Programming Environment for Multiple Accelerator Types

Accelerators

Edge Computing

Heterogeneous Computing

DescriptionAddressing performance portability across diverse accelerator architectures has emerged as a major challenge in the development of application and programming systems for high-performance computing (HPC) environments. Although the recent performance portability programming systems significantly improved the productivity to meet this challenge, it becomes notably intricate within computing nodes equipped with multiple accelerator types, each distinguished by unique performance attributes, optimal data layout, and binary formats. To navigate the intricacies of multi-accelerator programming, we propose CHARM-SYCL that is extended from our multi-accelerator execution environment called CHARM. This environment is designed to compose our own SYCL-based performance portability programming front-end and extreme-heterogeneous runtime back-end implemented with the IRIS from Oak Ridge National Laboratory. We present the architecture of CHARM-SYCL, delving into the protocol of compilation flow and SYCL-IRIS runtime integration. Our preliminary evaluation indicates potential productivity boost while providing reasonable performance compared to platform specific programming system and runtime.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Chasing Clouds with Donkeycar: Holistic Exploration of Edge and Cloud Inferencing Trade-Offs in E2E Self-Driving Cars

DescriptionIn autonomous driving, computational resources are strained by inference models. The viability of offloading inference to the cloud, considering latency between the car and data center, is questioned. We introduce a Cloud-Aided Real-time Inferencing Framework, integrating with Donkeycar and distributing computational load between cloud and edge. Utilizing Raspberry Pi 4 for edge inferencing and NVIDIA Triton Inference Server for the cloud, we demonstrate the framework's advantages, particularly in RNN performance, which achieved 90% autonomy. Our study includes a scaled car navigating obstacles, assessing factors like speed, resources, latency, and autonomy score. The system's performance shows faster inference time, eliminating bottlenecks, and processing 42 frames per second in the cloud, 11 times faster than on the edge. The poster will detail the strengths, limitations, and potential of leveraging cloud resources in real-time edge environments, focusing on autonomy scores and latency trade-offs.

Workshop

Checkpoint/Restart for CUDA Kernels

Fault Handling and Tolerance

DescriptionIn HPC clusters, it has become common to employ Checkpoint/Restart, that is, saving the execution state of applications in order to restore their computational progress at a later point in time. The benefits of this technique for clusters include more flexibility when reacting to changing workloads and an increased fault tolerance. While many clusters already benefit from C/R tools for traditional CPU applications, there is a lack of comparable tools enabling preemptive and transparent C/R for heterogeneous computing, where applications execute partly on accelerator devices, such as GPUs. This is despite the increasing use of GPUs as accelerators in HPC clusters. Therefore, we propose a novel C/R tool that enables saving the execution state of CUDA kernels, thus allowing preemptive C/R of GPU. We show that full-featured C/R for NVIDIA GPUs is possible despite the proprietary nature of the hardware and software of these devices.

Panel

Chiplet Ecosystem in High Performance Computing, AI/ML, and Data Acceleration

Artificial Intelligence/Machine Learning

Codesign

Heterogeneous Computing

DescriptionChiplets have become a compelling approach to incorporating specialization and massive bandwidth into compute and memory devices used in HPC. But there are many challenges in realizing the vision for affordable modular HPC using advanced packaging technology. We bring together a diverse panel of experts for a discussion on whether there will be an ecosystem or marketplace of Chiplets that will be available for system developers to use to build next generation devices and weigh the pros and cons of off-the-shelf Chiplets vs custom designed Chiplets. Chiplets could be processors, GPUs, Networking interfaces, optical engines, memory controllers, or FPGAs.

Paper

Choosing the Best Parallelization and Implementation Styles for Graph Analytics Codes: Lessons Learned from 1106 Programs

Architecture and Networks

Data Movement and Memory

Graph Algorithms and Frameworks

Performance Measurement, Modeling, and Tools

Programming Frameworks and System Software

DescriptionGraph analytics has become a major workload in recent years. The underlying core algorithms tend to be irregular and data dependent, making them challenging to parallelize. Yet, these algorithms can be implemented and parallelized in many ways for CPUs and even more ways for GPUs. We took 6 key graph algorithms and created hundreds of parallel CUDA, OpenMP, and parallel C++ versions of each of them, most of which have never been described or studied. To determine which parallelization and implementation styles work well and under what circumstances, we evaluated the resulting 1106 programs on 2 GPUs and 2 CPUs using 5 input graphs. Our results show which styles and combinations thereof work well and which ones should be avoided. We found that choosing the wrong implementation style can yield over a 10x performance loss on average. The worst combinations of styles can cost 6 orders of magnitude in performance.

Inclusivity

Cinematic Scientific Visualization: Where Computational Science Meets the Silver Screen

Inclusivity

TUT

XO/EX

DescriptionCinematic Scientific Visualization (CSV) is a growing subfield of visualization which aims to reach diverse audiences through films, immersive experiences, and social media. CSVs make complex scientific datasets accessible by using Hollywood-style computer graphics techniques and cinematography. In this session, hear from NCSA’s Advanced Visualization Lab on its pioneering efforts in this field.

Early Career Program

Inclusivity

Inclusivity

Workshop

Distributed Computing

Security

Workshop

Distributed Computing

Middleware and System Software

Runtime Systems

Workshop

Programming Frameworks and System Software

State of the Practice

Workshop

Data Analysis, Visualization, and Storage

Data Movement and Memory

DescriptionOrganizers' Closing Remarks

Workshop

Compilers

Heterogeneous Computing

Performance Optimization

Workshop

Architecture and Networks

Workshop

Data Analysis, Visualization, and Storage

Data Compression

Workshop

Closing Remarks and Best Paper

Performance Optimization

Workshop

Artificial Intelligence/Machine Learning

Energy Efficiency

Green Computing

Performance Measurement, Modeling, and Tools

Sustainability

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Cloud Computing at Scale: Tracking 4.5 Million Heartbeats of 3D Coronary Flow via the Longitudinal Hemodynamic Mapping Framework

XO/EX

DescriptionTracking hemodynamic responses to treatment and stimuli for long periods is a grand challenge. Moving from established single-heartbeat technology to longitudinal profiles would require continuous data reflecting a patient's evolving state, methods to extend the temporal domain that could be feasibly computed, and high-throughput resources. Although personalized models can accurately measure 3D hemodynamics over single heartbeats, state-of-the-art methods would require centuries of runtime on leadership-class systems to simulate one day of activity. We are establishing the Longitudinal Hemodynamic Mapping Framework (LHMF), which combines patient-specific models, wearables, and cloud computing to enable the first digital twins that capture longitudinal hemodynamic maps (LHMs). We demonstrate validity through comparison with ground truth data for 750 beats. We applied LHMF to generate the first LHM of coronary arteries spanning 4.5 million heartbeats. LHMF relies on an initial fixed set of representative simulations to enable the computationally tractable creation of LHM over heterogeneous systems.

Paper

Cloud Computing to Enable Wearable-Driven Longitudinal Hemodynamic Maps

Algorithms

Cloud Computing

Distributed Computing

Heterogeneous Computing

Large Scale Systems

State of the Practice

DescriptionTracking hemodynamic responses to treatment and stimuli over long periods remains a grand challenge. Moving from established single-heartbeat technology to longitudinal profiles would require continuous data describing how the patient’s state evolves, new methods to extend the temporal domain over which flow is sampled, and high-throughput computing resources. While personalized digital twins can accurately measure 3D hemodynamics over several heartbeats, state-of-the-art methods would require hundreds of years of wallclock time on leadership scale systems to simulate one day of activity. To address these challenges, we propose a cloud-based, parallel-in-time framework leveraging continuous data from wearable devices to capture the first 3D patient-specific, longitudinal hemodynamic maps. We demonstrate the validity of our method by establishing ground truth data for 750 beats and comparing the results. Our cloud-based framework is based on an initial fixed set of simulations to enable the wearable-informed creation of personalized longitudinal hemodynamic maps.

Paper

Clover: Toward Sustainable AI with Carbon-Aware Machine Learning Inference Service

Cloud Computing

Distributed Computing

Energy Efficiency

Green Computing

Programming Frameworks and System Software

State of the Practice

Sustainability

DescriptionThis paper presents a solution to the challenge of mitigating carbon emissions from hosting large-scale machine learning (ML) inference services. ML inference is critical to modern technology products, but it is also a significant contributor to carbon footprint. We introduce, Clover, a carbon-friendly ML inference service runtime system that balances performance, accuracy, and carbon emissions through mixed-quality models and GPU resource partitioning. Our experimental results demonstrate that Clover is effective in substantially reducing carbon emissions while maintaining high accuracy and meeting service level agreement (SLA) targets.

Workshop

Clushible: Tidal Wave-Like Configuration with Ansible

Large Scale Systems

Performance Optimization

State of the Practice

DescriptionConfiguration of HPC nodes is an important aspect of maintaining any HPC cluster. Our flagship HPE/Cray EX supercomputer, Derecho, is approximately 2,500 compute nodes and is susceptible to power interruptions from external factors such as lightning strike induced power sags and utility mishaps. These events challenged us to find an acceptable mean time to recovery. Ansible is our selected configuration management system but struggles with single large-scale runs of configuration despite optimizing individual runs such as tuning fork count and enabling pipelining. We needed a method to perform a large blast of configuration within a short time period to get the system back to a functional state or apply some level of remediation such as security updates. We therefore wrote a utility, Clushible, which wraps Ansible with ClusterShell's Python API to scale out the execution of Ansible that effectively took our standard full system run from multiple hours to minutes.

Workshop

Co-Design at System and Component Level: Examples from the DEEP and EPI Projects

Codesign

Hardware Technologies

Large Scale Systems

Software Engineering

DescriptionOptimizing the configuration of HPC systems to meet user requirements requires in-depth knowledge of application profiles and hardware limitations. A quantitative methodology, founded on the development of application-based benchmarks executed on representative hardware platforms, simulators, and models, proves invaluable for this objective. Benchmarking sheds light on how application codes and their core components perform on HPC systems, enabling the identification of performance bottlenecks and opportunities for enhancement on the software side. Moreover, it aids in understanding the impact of specific hardware features on application performance, whether positive or negative. During this presentation, we will share our experiences with benchmark-driven co-design approaches, derived from two European projects, namely DEEP and EPI. In DEEP, our focus was primarily on the system level, while in EPI the target were on processor and core-level aspects. We will describe the differences between both approaches and the challenges that we found.

Paper

Co-Design Hardware and Algorithm for Vector Search

Accelerators

Artificial Intelligence/Machine Learning

Codesign

Fault Handling and Tolerance

Performance Measurement, Modeling, and Tools

Post-Moore Computing

DescriptionVector search has emerged as the foundation for large-scale information retrieval and machine learning systems, with search engines like Google and Bing processing tens of thousands of queries per second on petabyte-scale document datasets by evaluating vector similarities between encoded query texts and web documents. As performance demands for vector search systems surge, accelerated hardware offers a promising solution in the post-Moore's Law era. We introduce FANNS, an end-to-end and scalable vector search framework on FPGAs. Given a user-provided recall requirement on a dataset and a hardware resource budget, FANNS automatically co-designs hardware and algorithm, subsequently generating the corresponding accelerator. The framework also supports scale-out by incorporating a hardware TCP/IP stack in the accelerator. FANNS attains up to 23.0x and 37.2x speedup compared to FPGA and CPU baselines, respectively, and demonstrates superior scalability to GPUs, achieving 5.5x and 7.6x speedup in median and 95th percentile latency within an eight-accelerator configuration.

Birds of a Feather

Commercial and Industrial HPC Use – What Is Really Needed?

State of the Practice

XO/EX

DescriptionNew HPC technologies offer new opportunities but also bring challenges for the users in a fast-developing HPC ecosystem. In order to get a better understanding and to prepare adapted offers for industrial / commercial HPC users, the EC funded the HPC-GIG project to organize three market studies on the current HPC offers for industry, the current and future needs of industrial and commercial HPC use and the legal and business requirements for industrial/commercial use. In this BoF, we will present the highlights of the market studies and discuss with both industrial users and HPC experts the outlook for future services.

Birds of a Feather

Community Engagement on NSF Learning and Workforce Development Programs to Democratize Cyberinfrastructure Access

Education

XO/EX

DescriptionThe National Science Foundation's Office of Advanced Cyberinfrastructure (OAC) supports the development and provisioning of state-of-the-art cyberinfrastructure resources, including HPC systems, tools, and services essential to the advancement of science and engineering. A critical vision and investment plan of OAC is to support inclusive and sustainable workforce development that will lead to transformative research leveraging such cyberinfrastructure. We seek to engage with the community and institutions to obtain feedback on preparing a workforce to address the evolving needs of research communities, including facilitating the invention and usage of CI, promoting democratized access, and fostering sustainable CI ecosystems.

Workshop

Comparative Evaluation of Bandwidth-Bound Applications on the Intel Xeon CPU MAX Series

Modeling and Simulation

Performance Measurement, Modeling, and Tools

DescriptionWe explore the performance of Intel Xeon MAX CPU Series, representing the most significant new variation upon the classical CPU architecture since the Xeon Phi. Given a large on-package high-bandwidth memory, the bandwidth-to-compute ratio has significantly shifted compared to other CPUs on the market. Since a large fraction of HPC workloads are sensitive to the available bandwidth, we explore how this architecture performs on a selection of HPC proxies and applications that are mostly sensitive to bandwidth, and how it compares to the previous 3rd generation Xeon processors (Ice Lake) and an EPYC 7003 Series Processor with 3D V-Cache Technology. We explore performance with different parallel implementations (MPI, MPI+OpenMP, MPI+SYCL), compiled with different compilers and flags, and executed with or without hyperthreading. We show how performance bottlenecks are shifted from bandwidth to communication latencies for some applications, and demonstrate speedups compared to the previous generation between 2.0x-4.3x.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Comparative Study of the Cache Utilization Trends for Regional Scientific Data Caches

DescriptionLarge scientific collaborations often have many users accessing the same data files, creating repeated file transfers over long distances. Data accesses to the distant data sources cause long latency to the applications and can be further delayed due to limited network bandwidth. XCache-based in-network regional data caching system stores scientific data and can reduce the network traffic and access latency. We examine the established Southern California Petabyte Scale Cache (So Cal Cache) and the newly deployed Chicago Regional Cache (Chicago Cache) for a high-energy physics experiment to analyze cache utilization trends and compare regional data access patterns. The results of the cache utilization trends show that the cache contributed to sharing a majority of data, and regional differences can be explained by the comparative study. Additionally, predictions of cache behavior show low error values in both regions, providing a useful tool for future resource planning.

Workshop

Comparing a Naive and a Tree-Based N-Body Algorithm Using Different Standard SYCL Implementations on Various Hardware

Accelerators

Algorithms

Applications

Compilers

Heterogeneous Computing

Programming Frameworks and System Software

Runtime Systems

DescriptionN-body algorithms aim to calculate the interactions between n different bodies with the goal to obtain their trajectories. Algorithms that solve the n-body problem can leverage significant amounts of parallelism. Today, GPUs are commonly used besides CPUs for the execution of parallel algorithms. However, targeting several hardware platforms at once often requires to use different programming languages. In this work, we have implemented the naive and tree-based Barnes-Hut n-body algorithm using SYCL to target CPUs and GPUs with the same programming language. We compare both algorithms on heterogeneous hardware platforms as well as for different SYCL implementations, with respect to their runtime behavior and their support for several performance optimizations. Our results show that some optimizations show unexpected behavior for different SYCL implementations. And even though data center GPUs have a clear performance advantage for the naive algorithm, surprisingly consumer GPUs offer competitive runtimes for the Barnes-Hut algorithm.

Workshop

Comparing Power Signatures of HPC Workloads: Machine Learning vs Simulation

Energy Efficiency

Green Computing

Sustainability

DescriptionPower is a limiting factor for supercomputers limiting their scale and operation. Characterizing the power signatures of different application types can enable data centers to operate efficiently, even when power constrained. This paper investigates power profiles of diverse scientific applications, spanning both traditional simulations and modern machine learning (ML) running on the Perlmutter supercomputer at the National Energy Research Scientific Computing Center (NERSC). Our findings indicate that traditional simulations typically consume more power on average than ML workloads. Furthermore, ML applications exhibit periodic power fluctuations attributed to epoch transitions during training. Finally, we discuss the potential implications of the research insights toward automatic demand response (ADR) and considerations for designing future systems.

Exhibitor Forum

Composability in HPC: A User’s View from the Trenches

Architecture and Networks

Cloud Computing

XO/EX

DescriptionAfter our very lively panel last year at SC22, “Smackdown: Does HPC Need Composability Now?", where 40% of attendees agreed with the premise, and over 50% of attendees noted it was either “quite” or “extremely” relevant to the problems they are currently trying to solve, we are following up this year with a user perspective from the trenches.

While composable vendors literally promise the impossible — between “impossible servers” and “software-defined hardware” — the reality of implementing these systems in the wild can be sobering, despite promising many benefits for the HPC user.

In this talk, Sean Taylor will share his perspective as a Senior Linux HPC Engineer at Oak Ridge National Laboratory after having evaluated, deployed, dismantled, and ultimately adopted various composable systems.

This technical deep dive will parse out scenarios where composable solutions work well vs. others where it is merely helpful, and will outline the challenges of implementation. Sean will break down key factors potential users need to consider in order to take advantage of what composability offers without falling into its pitfalls.

Particular detail will be paid to the key benefits of a composable vs. static architecture, notably the fact the nodes are nondeterministic and malleable and configured based on job need, which in ORNL’s experience affords “the most efficient use of resources and substantially better cluster efficacy.”

Workshop

Composable HPC Curricula: Embracing the UNIX Development Paradigm and Leveraging Core Practices from Linux Kernel Development in HPC Training Marterial Development

Education

State of the Practice

Sustainability

DescriptionThe current landscape of HPC courses and training materials heavily emphasizes foundational concepts, especially those related to MPI/OpenMP and CUDA. However, there seems to be a gap when it comes to mid-level topics like sparse linear algebra using HPC. Given the breadth of subject matter, such courses are less commonly developed.

A proposed solution leans on the UNIX philosophy of 'doing one thing and doing it well,' combined with the collaborative ethos and tools that have propelled open-source projects, viz. the Linux kernel, to success. By adopting these practices, we can enhance the distribution of HPC training, especially in relatively underrepresented topics. The end goal is to produce HPC materials focused to specific subjects such as sparse linear algebra numerics, along with coverage for rapidly evolving HPC topics.

In this lightning talk, we'll delve into the advantages, challenges around the potential of this fresh approach to HPC training material development.

Tutorial

Compression for Scientific Data

Algorithms

Applications

Data Compression

I/O and File Systems

TUT

DescriptionLarge-scale numerical simulations, observations, experiments and AI computations are generating or consuming very large datasets that are difficult to analyze, store, and transfer. Data compression is an attractive and efficient technique to significantly reduce scientific datasets. This tutorial reviews the motivations, principles, techniques and error analysis methods for lossy compression of scientific datasets. It details the main compression stages (e.g. decorrelation, approximation and coding) and their variations through the presentation of the state of the art lossy compressors: SZ, ZFP, TThresh, MGARD, SPERR. A special attention is spent on lossy compression trustability. The tutorial addresses the following questions: Why lossy compression? How does compression work? How to measure and control compression error? What are the current use cases? The tutorial uses examples of real-world scientific datasets to illustrate the different compression techniques and their performance. From a participant perspective, the tutorial will detail how to use compression software as executables and as modules integrated in parallel I/O libraries (ADIOS, HDF5). This half-day tutorial, given by two of the leading teams in this domain and targeting primarily beginners interested in learning about lossy compression for scientific data, is improved from the highly rated tutorials given at ISC17-22 and SC17-22.

Workshop

Compression of Scientific Simulation Data by Stochastic Basis Expansion – Example on Multiple Computer Systems

Data Analysis, Visualization, and Storage

Data Movement and Memory

DescriptionTargeting scientific data storage and processing of large-scale complex problems, we consider the introduction of stochastic basis expansions for compressing and restoring data with low conversion costs by the effective use of high-performance computer systems and file systems. Here we target Monte Carlo simulation and use high-order stochastic basis expansion to reduce data sizes while suppressing the degradation of accuracy. In order to reduce the analysis cost involved in high-order stochastic expansions, we used a scalable algorithm that can efficiently utilize high-performance computer systems and applied the method to static and dynamic elastic problems. Using the nearly full system of CPU-based Fugaku (147456 nodes), we show that 1300 trillion degrees-of-freedom storage required in Monte Carlo analysis can be compressed by 35-fold to 37 trillion degrees-of-freedom and also show that the same compression rate can be attained using 5120 nodes of AMD GPU-based Frontier (54% of the system) with high efficiency.

Exhibitor Forum

Compute Express Link (CXL): Advancing Coherent Connectivity

Artificial Intelligence/Machine Learning

Architecture and Networks

Hardware Technologies

XO/EX

DescriptionCompute Express Link™ (CXL™) – an open industry standard interconnect – offers coherency and memory semantics using high-bandwidth, low-latency connectivity between the host processor and devices such as accelerators, memory buffers, and smart I/O devices. CXL advances memory expansion and fabric management capabilities to increase system scalability and flexibility across multiple compute domains, enabling resource sharing for higher performance, reduced software stack complexity, and lower overall system cost. The fabric enhancements and memory expansion features included in CXL 3.0 delivered new levels of composability required in HPC and modern data center environments.

Earlier this year, the CXL Consortium hosted the first official testing of CXL 1.1 products, Pre-FYI testing for CXL 2.0, and rolled out the first iteration of the CXL Integrators List. The HPC community will get a sneak peek at CXL products entering the market in the CXL Consortium booth (#1301).

This session will kick off with an update from the Consortium and introduce enhancements in the latest release of the CXL specification. The presentation will also highlight the challenges in the HPC industry and explore how CXL can disaggregate resources and enable composable systems. Additionally, the presentation will introduce CXL technology demos from our member companies showcasing CXL solutions, multi-vendor demos, and proofs-of-concept to highlight beneficial use cases for HPC and modern data center environments. Attendees will have the opportunity to ask CXL expert questions about the technology and gain insight into how CXL can help address challenges in the HPC community.

Invited Talk

Computing and Culture

Education

HPC in Society

DescriptionWhat do culture and identity have to do with computing, including HPC? This talk will initiate a conversation - A conversation about the ways in which an awareness of and attention to individual identities and workplace culture can positively impact your creativity, innovation, and productivity. Computing was created to be taught to everyone but it has become more narrow. In addition to computational systems and algorithmic processes, advanced computing capabilities also involve societal impact. The decisions about what is worthy of investigation, what problems get addressed (and funded) what is acceptable evidence of success, and so forth are influenced by who we are. These lived experiences represent a rich repository from which to draw phenomenal ideas. With the aim of urging reflection, this talk is intended to spur discussion about what Tissenbaum and colleagues (2021) refer to as “more diverse, equitable, and meaningful endpoints.”

Panel

Computing at the Edge: HPC and AI Supporting Recent US Space Missions

Artificial Intelligence/Machine Learning

Edge Computing

IoT

DescriptionNASA’s space missions have captured the imagination of those around the world for generations. From the International Space Station to Artemis, there is a need for HPC, data movement, analytics, and AI capabilities delivered as efficient pipelines. For example future missions, such as the Dragonfly mission to Titan and other icy moon missions may require AI at the extreme edge – with planned data flows to the core. Demand is skyrocketing with use cases spanning operational decision-making at the edge, ensuring the health and safety of our astronauts, and advancing scientific discovery. In fact, this edge case ability, with AI/ML, is changing the business models for the evolving space and climate economies and the way we architect HPC systems. Hear from and engage with our panel of experts about how recent missions have expanded our concept of computing at the edge – for both space-based and terrestrial challenges.

Workshop

Computing-as-a-Service Infrastructure for Accelerating Digital Engineering

DescriptionHigh-Performance Computing (HPC) provides significant advantages to researchers across diverse fields due to its capability to handle complex and data-intensive computations beyond conventional systems. However, not all researchers possess the expertise to leverage HPC effectively. Sandia's Computing as a Service (CaaS) project aims to democratize HPC by offering its performance without requiring researchers to become HPC experts.

CaaS first developed a prototype for the DetNet team, focusing on delivering simulation-as-a-service for the detonator community. This presentation highlights the prototype's components (UI, Cloud, HPC), demonstrating their use of containerization for scalability and portability. The creation of containers optimized for HPC resource utilization is discussed, covering performance and security challenges. Additionally, the deployment of frontend containers via Kubernetes is outlined, including challenges linked to integrating frontend and HPC cluster containers. This initiative bridges the gap between HPC capabilities and researchers' accessibility, fostering a collaborative environment for advanced computational research.

Workshop

Consider Studying All-Pairs Shortest Paths

Graph Algorithms and Frameworks

Linear Algebra

Programming Frameworks and System Software

State of the Practice

DescriptionThere is a well-known connection between the all-pairs shortest paths (APSP) problem and Gaussian elimination. In the context of this workshop, that makes APSP a good candidate to study because, for an input graph that is sparse but fully connected, the computation lies at the boundary between "sparse" and "dense" and between "regular" and "irregular" parallelism. It may, therefore, be an illuminating prototype or analogue for trends we might foresee in algorithms, programming, and system design as "dense" problems like deep learning become "sparse."

Exhibits

Flash Session

Considerations for AI/ML Networking

XO/EX

DescriptionOptimizing AI/ML networking involves balancing speed and performance while managing the inherent complexity of data transfer and computation. Efficient data pipelines, low-latency communication, and robust hardware configurations are crucial for enabling fast, high-performing AI/ML applications within intricate network infrastructures.

Workshop

Constructing a Large-Scale Biomedical Knowledge Graph and Its Applications in Drug Discovery

Applications

State of the Practice

DescriptionIn the past few decades, the biomedical research community has acquired a wealth of knowledge, much of which is stored in scientific literature as unstructured text. Converting this text into structured form is crucial for developing new methodologies and applications that can fully utilize this knowledge. To achieve this goal, two basic problems must be addressed: named entity recognition (NER) and relation extraction (RE). NER involves identifying the concepts or entities in texts, such as diseases, genes/proteins, and chemical compounds. RE, on the other hand, aims to extract the relationships between these entities. The information extracted from NER and RE can be used to create knowledge graphs, where nodes represent entities in the text and edges represent their relationships. This presentation will discuss our team's work on the LitCoin NLP Challenge organized by NIH, for which we were awarded first place. Using pipelines developed for the challenge, we processed all PubMed articles and created a large-scale biomedical knowledge graph. The accuracy of this large-scale relation extraction is estimated using manual verification of a sample of the extracted data and found to be at the human annotation level. We also incorporated relation information from 40 public databases and relations inferred from publicly available genomics datasets. Our knowledge graph consists of over 11 million entities and more than 40 million relations. We have developed versatile query functions and knowledge discovery tools for accessing and mining structured data in the knowledge graph. Finally, we will discuss some drug discovery-related applications enabled by this large-scale knowledge graph.

Birds of a Feather

Continuum Computing: A Multi-Paradigm Approach

Cloud Computing

Distributed Computing

XO/EX

DescriptionHigh-Performance Computing systems that have been traditionally deployed at a single site are expected to significantly expand their reach to include a variety of remote edge systems. These edge systems include computing platforms located near instruments as well as instruments themselves. Examples range from interconnected ecosystems of large science instruments to smart energy grids supported by complex analytics and control. These interconnected systems form a compute and instrument continuum wherein computation is orchestrated in various stage. This BoF will discuss the aggregation and synthesis of previously distinct techniques and tools (including HPC, AI/ML, and digital twins) to enable continuum computing.

Posters

Scientific Visualization & Data Analytics Showcase

Conversing Faults: The 2019 Ridgecrest Earthquake

Data Analysis, Visualization, and Storage

HPC in Society

Modeling and Simulation

Visualization

XO/EX

DescriptionThe 2019 Ridgecrest earthquakes occurred in a complex system of fault lines in the Mojave desert. Separated by 34 hours, the earthquakes were caused by ruptures in separate but nearby faults. In this study of the geophysical processes underlying these events the surface, known faults and the volumetric subsurface are modeled on HPC systems. Visualization techniques are used to analyze the simulation results in their three-dimensional context.

Doctoral Showcase

Posters

Corralling the Computing Continuum: Mobilizing Modern Distributed Resources for Machine Learning and Accessible Computing

Cloud Computing

Distributed Computing

DescriptionTo achieve the resource agnostic flexibility of compute described by the computing continuum, we combined our work in workload profiling and cost estimation with task provisioning to present DELTA–a framework for serverless workload placement across a computing ecosystem. To address the dynamic availability of modern computing resources as well as the multiple costs involved in computing, we presented extensions of our framework as DELTA+ which demonstrated the ability for resource provisioning and multidimensional compute costs.

To bring this idea of resource abstraction via serverless into the rapidly growing field of federated learning, we developed and released FLoX: Federated Learning on funcX. This framework was built from the ground up around a serverless computing paradigm with experimentation and usability in mind. Extending the lessons learned from DELTA around self-adaptive systems, we began exploring the potential of automating tradeoffs found in FLoX and federated learning in general.

Looking ahead, we are developing FLoX into a much more robust framework to enable the use of a wide range of computing resources while abstracting away the difficulties of configuring and optimizing a federated learning experiment. Additionally, we are actively working on a re-release of DELTA with all extensions combined into one framework with updated cost and execution time predictors and complete resource provisioning ability. Finally, we are designing an integration between FLoX and DELTA that will enable serverless-based FL to automatically place each component of an FL flow and move data as necessary to best use the available resources.

Workshop

Correctness '23 – Afternoon Break

Applications

Software Engineering

Workshop

Correctness Workshop Opening Remarks

Applications

Software Engineering

Exhibitor Forum

Cost-Effective LLM Inference Solution Using SK hynix's AiM (Accelerator-in-Memory)

Accelerators

Artificial Intelligence/Machine Learning

Architecture and Networks

Hardware Technologies

XO/EX

DescriptionLarge language models (LLMs) are becoming increasingly popular for a variety of AI services, such as chatbots and virtual assistants. However, serving LLMs can be challenging, due to their high operating costs and long service latency. The main challenge in serving LLMs is the memory bandwidth bottleneck. LLMs require a lot of memory to store their parameters, and this memory bandwidth can be a limiting factor in the speed of inference. As LLM models continue to grow in size, this problem is only going to get worse.

We propose a new solution to the memory bandwidth bottleneck for serving LLMs. Our solution, called AiM (Accelerator-in-Memory), is a SK hynix's PIM device that is specialized for serving LLMs. AiM can exploit the abundant memory bandwidth available inside memory to accelerate GEMV operations, which are the most computationally expensive operations in LLM inference. We evaluated AiM on a variety of LLM models and tasks. Our results show that AiM can significantly improve the performance and energy efficiency of LLM inference. For example, on the GPT-3 model, AiM can achieve up to 10x speedup at lower cost and energy consumption over the state-of-the-art GPU systems.

We believe that AiM is a promising solution to the memory bandwidth bottleneck for serving LLMs. AiM can significantly improve the performance and energy efficiency of LLM inference, making it possible to deploy LLMs in real-world applications.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Cray EX40 Cluster Intrusion Detection System

XO/EX

DescriptionAnalysis of a High-Performance Computing cluster’s external network traffic provides the opportunity to identify security issues, cluster misuse, or configuration problems without reducing performance. This project captured the external network traffic to and from a Cray EX40 cluster over three months and analyzed it utilizing two open-source intrusion detection tools, Suricata and Zeek. The tool alerts were sent to Splunk via rsyslog for parsing and analysis. Several security concerns were identified, including excessive failed authentication attempts and the use of four invalid certificates. Multiple cluster configuration issues were also identified, including recurrent anomalous Domain Name Service (DNS) queries which comprised 97% of all DNS traffic and incorrectly routed outbound Hypertext Transfer Protocol traffic. The port mirror architecture combined with network intrusion detection tools offered valuable insight into security concerns and several configuration issues. Excessive failed authentication attempts and a switch DNS configuration issue were both resolved by this project.

Workshop

Cross-Facility Orchestration of Electrochemistry Experiments and Computations

Large Scale Systems

Performance Measurement, Modeling, and Tools

Software Engineering

DescriptionInstrument-computing ecosystems that support automated electrochemical workflows typically require the integration of disparate instruments such as syringe pump, fraction collector, and potentiostat, all connected to an electrochemical cell. These specialized instruments with custom software and interfaces are typically not designed for network integration and remote automation. We developed a networked ecosystem of these instruments and computing platforms, including software that enables automated workflow orchestration from remote computers. Specifically, we developed Python wrappers of APIs and custom Pyro client-server modules to support remote operation of these instruments over the ecosystem network. Herein, we describe a specific workflow {for generating and validating voltammogram (I-V) measurements} of an electrolyte solution pumped into the electrochemical cell. We demonstrate the orchestration of this workflow which is composed using a Jupyter notebook and executed on a remote computer.

Workshop

Cross-Institutional Research Engagement Network (CIREN): Initial Project Goals and Objectives in Support of Training, Mentoring, and Research Facilitation

Education

State of the Practice

DescriptionThe Cross-Institutional Research Engagement Network (CIREN) is a collaborative project between the University of Tennessee, Knoxville (UTK) and Arizona State University (ASU). This project’s purpose is to fill critical gaps in the development and retention of cyberinfrastructure (CI) facilitators via training, mentorship, and research engagement. Engagements may include research projects at the CI facilitator’s local institution, between CIREN partner institutions, and through NSF’s ACCESS program. This lightning talk will detail the training curriculum and mentorship activities the project has implemented in its first year as well as plans for its future research engagements. Feedback is welcome from the community with respect to project directions, best practices, and challenges experienced in implementing this or similar programs at academic institutions.

Workshop

Cross-Stack System Techniques for Trillion-Parameter Scale Model Inference

Large Scale Systems

Middleware and System Software

Programming Frameworks and System Software

Workshop

cuAlign: Scalable Network Alignment on GPU Accelerators

Algorithms

Applications

Architecture and Networks

DescriptionGiven two graphs, the objective of network alignment is to find a one-to-one mapping of vertices in one graph to vertices in the other, such that the number of overlaps is maximized. Network alignment is an important optimization problem with several applications in bioinformatics, computer vision and ontology matching. Since it is an NP-hard problem, efficient heuristics and scalable implementations are necessary. In this work, we introduce a novel framework (cuAlign) that combines intra-network proximity using node (vertex) embedding, sparsification for computational efficiency, and belief propagation (BP) and approximate weighted matching for alignment. We demonstrate qualitative improvements up to 22% over state-of-the-art approaches and provide a scalable implementation targeting modern GPU accelerators. We demonstrate up to 19× speedup for belief propagation, 3× speedup for approximate weighted matching, and 15× total, relative to a state-of-the-art multi-threaded implementation. We believe that our work will enable algorithmic improvements and applications of network alignment.

Workshop

CuPBoP-AMD: Extending CUDA to AMD Platforms

Performance Measurement, Modeling, and Tools

Performance Optimization

DescriptionThe proliferation of artificial intelligence applications has underscored the need for increased portability among graphic processing units (GPUs) from different vendors. With CUDA as one of the most popular GPU programming languages, CuPBoP (CUDA for Parallelized and Broad-range Processors) aims to provide NVIDIA's proprietary CUDA language support to a variety of GPU and CPU platforms by translating CUDA programs at the LLVM/NVVM IR level. Our work extends CuPBoP to AMD GPUs as CuPBoP-AMD. CuPBoP-AMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs. Currently, CuPBoP-AMD translates a broader range of applications in the Rodinia benchmark suite while maintaining approximately equal performance than the existing state-of-the-art AMD-developed translator, HIPIFY, without requiring programmer intervention.

Birds of a Feather

Current and Future HPC Storage Environments

Data Analysis, Visualization, and Storage

XO/EX

DescriptionStorage is an important part of HPC environments, especially with the explosion of data that comes with increasing computational power. But there are a number of evolving options and tradeoffs for storage (POSIX/S3, SSD/HDD/tape, on-premises/public cloud, management policies, etc.). The goal of this BoF is to facilitate a discussion about storage environments, and to share and hear plans and ideas from the audience and from the BoF leaders. Ultimately, we hope to help each other and the community better understand the options and best practices in the storage landscape.

Tutorial

Custom FPGA Workload Development Using Open FPGA Stack and oneAPI

Accelerators

Programming Frameworks and System Software

TUT

DescriptionOpen FPGA Stack is the first complete hardware and software infrastructure that is fully open sourced and comprised of composable hardware code and upstreamed kernel code to Linux.org to enable a collaborative community of FPGA developers. The intention of OFS is to provide an efficient approach to develop a custom FPGA-based platform or solution by providing a framework of synthesizable code, a simulation environment, and scripts that developers can use as-is or modify. OFS source code can be used for development of an Intel, 3rd party, or custom FPGA solution. This hands-on tutorial will spotlight Open FPGA Stack, as well as oneAPI (supported by OFS), by providing FPGA developers the opportunity to do some basic FPGA workload development using the open source OFS infrastructure, source code, and documentation we provide on GitHub at www.github.com/OFS. Attendees will modify the Acceleration Functional Unit Region (AFU Region) to create their own FPGA workload using both RTL and C++ (enabled by oneAPI).

Paper

cuSZp: An Ultra-Fast GPU Error-Bounded Lossy Compression Framework with Optimized End-to-End Performance

Accelerators

Data Analysis, Visualization, and Storage

Data Compression

DescriptionModern scientific applications and supercomputing systems are generating large amounts of data in various fields, leading to critical challenges in data storage footprints and communication times. To address this issue, error-bounded GPU lossy compression has been widely adopted, since it can reduce the volume of data within a customized threshold on data distortion. In this work, we propose an ultra-fast error-bounded GPU lossy compressor cuSZp. Specifically, cuSZp computes the linear recurrences with hierarchical parallelism to fuse the massive computation into one kernel, drastically improving the end-to-end throughput. In addition, cuSZp adopts a block-wise design along with a lightweight fixed-length encoding and bit-shuffle inside each block such that it achieves high compression ratios and data quality. Our experiments on NVIDIA A100 GPU with 6 representative scientific datasets demonstrate that cuSZp can achieve an ultra-fast end-to-end throughput (95.53x compared with cuSZ) along with a high compression ratio and high reconstructed data quality.

Workshop

CXL Memory as Persistent Memory for Disaggregated HPC: A Practical Approach

Applications

Data Movement and Memory

Heterogeneous Computing

I/O and File Systems

Large Scale Systems

Middleware and System Software

Performance Measurement, Modeling, and Tools

Performance Optimization

DescriptionIn the landscape of High-Performance Computing (HPC), the quest for efficient and scalable memory solutions remains paramount. The advent of Compute Express Link (CXL) introduces a promising avenue with its potential to function as a Persistent Memory (PMem) solution in the context of disaggregated HPC systems. We present a comprehensive exploration of CXL memory’s viability as a candidate for PMem, supported by physical experiments conducted on cutting-edge multi-NUMA nodes equipped with CXL-attached memory prototypes. Our study not only benchmarks the performance of CXL memory but also illustrates the seamless transition from traditional PMem programming models to CXL, reinforcing its practicality.

To substantiate our claims, we establish a tangible CXL prototype using an FPGA card embodying CXL 1.1/2.0 compliant endpoint designs. Performance evaluations, executed through the STREAM and STREAM-PMem benchmarks, showcase CXL memory’s ability to mirror PMem characteristics in App-Direct and Memory Mode while achieving impressive bandwidth metrics.

Exhibitor Forum

CXL-Based Memory Disaggregation for HPC and AI Workloads

Architecture and Networks

Data Movement and Memory

Hardware Technologies

XO/EX

DescriptionThe Compute Express Link (CXL) shows a characteristic of composability by nature, which enables the disaggregation of memory resources via CXL.mem transactions. In this forum, we focus on the demonstration of two powerful use cases - memory pooling and sharing - from which users can get benefits that have never been experienced before.

Memory Pooling Case: A key to alleviate a memory stranding issue
The memory utilization of each host server in a compute cluster varies time to time, which mandates system operators to provision each server with DRAM capacity at its peak utilization for real-time or interactive applications. Unused memory in each server can never be utilized by other servers, which makes stranded memory. SK hynix’s Niagara, a CXL-based pooled memory solution, addresses this stranded memory issue. Our FPGA-based pooled memory solution can be connected to four host servers and supports four DDR DIMM channels with maximum capacity of 1TB. In our exhibition booth, we will demonstrate how Niagara can alleviate a memory stranding issue with its Elastic Memory feature.

Memory Sharing Case: A key to realize zero-copy distributed computing framework
Conventional distributed computing frameworks such as Spark and Ray suffer from heavy network traffic for distributing data and tasks to computing nodes in a cluster. To address this issue, we have implemented a memory sharing feature in Niagara so that multiple host servers can directly access the same shared data without data transfer over a network. In this forum, we demonstrate the effectiveness of memory sharing with a real workload in Ray framework, which is known for being used in ChatGPT.

Workshop

D-HPC – Afternoon Break

State of the Practice

Workshop

D-HPC Opening: “LLMs and Democratizing HPC”

State of the Practice

Workshop

D-HPC: Closing Remarks

State of the Practice

Workshop

DAOS as HPC Storage: Exploring Interfaces

Data Analysis, Visualization, and Storage

Data Movement and Memory

DescriptionWe present preliminary results of research investigating how the DAOS object store performs under a range of different object class configurations, data access interfaces, and application usage patterns. We demonstrate that good performance can be achieved for a number of different approaches using DAOS but that interface choice and configuration do impact achievable I/O bandwidth.

Workshop

DAOS Beyond Persistent Memory: Architecture and Initial Performance Results

Data Movement and Memory

Heterogeneous Computing

I/O and File Systems

DescriptionThe Distributed Asynchronous Object Storage (DAOS) is an open source scale-out storage system that is designed from the ground up to support Storage Class Memory (SCM) and NVMe storage in user space. Until now, the DAOS storage stack has been based on Intel Optane Persistent Memory (PMem) and the Persistent Memory Development Kit (PMDK). With the discontinuation of Optane PMem, and no persistent CXL.mem devices in the market yet, DAOS continues to support PMem-based servers but now also supports server configurations where its Versioning Object Store (VOS) is held in DRAM. In this case, the VOS data structures are persisted through a synchronous Write-Ahead-Log (WAL) combined with asynchronous checkpointing to NVMe SSDs.

This contribution summarizes the recently accepted "DAOS beyond Persistent Memory" IXPUG-ISC23 workshop paper (M. Hennecke et al., https://doi.org/10.1007/978-3-031-40843-4_26; not live yet, see PDF upload), which describes the new non-PMem DAOS architecture and reports first performance results.

Workshop

DAOS Project Update

Data Analysis, Visualization, and Storage

Data Movement and Memory

DescriptionThe Distributed Asynchronous Object Storage (DAOS) is an open source scale-out storage system that is designed from the ground up to support Storage Class Memory (SCM) and NVMe storage in user space. This PDSW23 WiP session highlights two new technologies that are available in the recently released DAOS 2.4: The “Metadata on SSD” code path that enables DAOS servers without Optane Persistent Memory, and a new runtime interception library that can intercept both data and metadata calls of legacy POSIX applications to redirect them to DAOS File System (DFS) calls.

Birds of a Feather

DAOS Storage Community BoF

Data Analysis, Visualization, and Storage

XO/EX

DescriptionDAOS (https://docs.daos.io/) is an open-source scale-out object store that delivers extremely high performance to the most data-intensive HPC/AI workloads. With growing adoption, DAOS has seen significant community contributions like domain-specific container types, additional hardware support beyond x86_64 (e.g. ARM64), and enabling DAOS in the cloud.

This BoF brings together the DAOS community to discuss, share experiences, and brainstorm on future enhancements of DAOS. Topics include practical experiences with on-prem and cloud deployments, application use cases, and the software roadmap. This session targets end users, middleware developers, system administrators, DAOS core software developers, and vendors of DAOS-based hardware/software/cloud offerings.

Workshop

Dask-Extended External Tasks for HPC/ML In Transit Workflows

Distributed Computing

Middleware and System Software

Runtime Systems

DescriptionIn situ workflows are inescapable to fully leverage exascale architectures. They can however be complex to build as simulation and data analytics come from two different software ecosystems with their own paradigms. We extend deisa by introducing the concept of external tasks to support the description of analytics graphs spanning multiple timesteps ahead of time while improving scalability. This new approach leads to straightforward support for contracts to limit the data transferred to that actually analyzed in a given execution. We implement this approach using Dask and MPI and evaluate it using an in-transit workflow that uses an unsupervised ML model. We compare our work to plain Dask and to the previous version of deisa. Our work performs better, up to ×7, for the simulation and ×3, for the analytics compared to deisa and ×18 less costly compared to plain Dask. All of those with similar development efforts.

Paper

DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector Multiplication

Algorithms

Linear Algebra

Post-Moore Computing

DescriptionSparse matrix-vector multiplication (SpMV) plays a key role in computational science and engineering, graph processing and machine learning applications. Much SpMV work was devoted to resolving problems such as random access to vector x and unbalanced load. However, we have experimentally found that the compute of inner products still occupies much overhead in SpMV operation, which has been largely ignored in existing work.

In this paper, we propose DASP, a new algorithm using specific dense MMA units for accelerating the compute part of general SpMV. We analyze the row-wise distribution of nonzeros and group the rows into three categories. We then organize them into small blocks of proper sizes to meet the requirement of MMA computation. For the three categories, DASP offers different strategies to complete SpMV. The experimental results on the latest NVIDIA Ampere and Hopper GPUs show that our DASP brought significant speedups over state-of-the-art SpMV work.

Workshop

Data Analytics Program in Community Colleges in Preparation for STEM and HPC Careers

Education

State of the Practice

DescriptionStudents in community colleges are either interested in a quick degree or a skill that allows them to enter into a career area while minimizing debt. Attending a four-year university can be a challenge, and acceptance can be competitive.

The job market is challenged to hire and retain diverse staff especially within High Performance Computing (HPC) or a government laboratory. Higher salaries, potentially better benefits, or opportunities for remote work are contributing factors for this industry.

To encourage interest in HPC, NERSC partnered with Laney College to create a Data Analytics Program. Once Laney faculty learn how to teach the classes toward a certificate program, they fill a need for their students to build the skill in data analytics toward a career or to continue toward a four-year degree as transfer students.

We describe how NERSC partners with Laney to create a pipeline toward a data analytics career.

Paper

Data Flow Lifecycles for Optimizing Workflow Coordination

Cloud Computing

Distributed Computing

Data Movement and Memory

Performance Measurement, Modeling, and Tools

DescriptionA critical performance challenge in distributed scientific workflows is coordinating tasks and data flows on distributed resources. To guide these decisions, this paper introduces data flow lifecycle anal- ysis. Workflows are commonly represented using directed acyclic graphs (DAGs). Data flow lifecycles (DFL) enrich task DAGs with data objects and properties that describe data flow and how tasks interact with that flow. Lifecycles enable analysis from several important perspectives: task, data, and data flow. We describe representation, measurement, analysis, visualization, and opportunity identification for DFLs. Our measurement is both distributed and scalable, using space that is constant per data file. We use lifecycles and opportunity analysis to reason about improved task placement and reduced data movement for five scientific workflows with different characteristics. Case studies show improvements of 15×, 1.9×, and 10–30×. Our work is implemented in the DataLife tool.

Workshop

Data Race Detection Using Large Language Models

Applications

Software Engineering

DescriptionLarge language models (LLMs) are demonstrating significant promise as an alternate strategy to facilitate analyses and optimizations of high-performance computing programs, circumventing the need for resource-intensive manual tool creation. In this paper, we explore a novel LLM-based data race detection approach combining prompting engineering and fine-tuning techniques. We create a dedicated dataset named DRB-ML, which is derived from DataRaceBench, with fine-grain labels showing the presence of data race pairs and their associated variables, line numbers, and read/write information. DRB-ML is then used to evaluate representative LLMs and fine-tune open-source ones. Our experiment shows that LLMs can be a viable approach to data race detection. However, they still cannot compete with traditional data race detection tools when we need detailed information about variable pairs causing data races.

Workshop

Data-Driven Discovery of Anchor Points for PDC Content

Education

Heterogeneous Computing

Reproducibility

State of the Practice

DescriptionThe Parallel and Distributed Computing community has been interested in integrating PDC content into early CS curriculum to prime the students for more advanced materials and build a workforce able to leverage advanced computing infrastructure. To deploy this strategy at scale, it is important to identify anchor points in early CS courses where we can insert PDC content.

We present an analysis of CS courses that primarily focuses on CS1 and Data Structure courses. We collected data on course content through in-person workshops, where instructors of courses classified their course materials against standard curriculum guidelines.

By using these classification, we make sense of how Computer Science is being taught. We highlight different types of CS1 and Data Structure courses. And we provide reflection on how that knowledge can be used by PDC experts to identify anchoring points for PDC content, while being sensitive to the needs of instructors.

Workshop

DDStore: Distributed Data Store for Scalable Training of Graph Neural Networks on Large Atomistic Modeling Datasets

Artificial Intelligence/Machine Learning

Graph Algorithms and Frameworks

DescriptionGraph neural networks (GNNs) are a class of Deep Learning models used in designing atomistic materials for effective screening of large chemical spaces. To ensure robust prediction, GNN models must be trained on large volumes of atomistic modeling data on leadership class supercomputing facilities. Even with the advent of modern architectures that consist of multiple storage layers that include node-local NVMe devices in addition to device memory for caching large datasets, extreme-scale model training faces I/O challenges at scale.

We present DDStore, an in-memory distributed data store designed for GNN training on large-scale graph data. DDStore provides a hierarchical, distributed, data caching technique that combines data chunking, replication, low-latency random access, and high throughput communication. DDStore achieves near-linear scaling for training a GNN model using up to 1000 GPUs on the Summit and Perlmutter supercomputers, and reaches up to a 6.15x reduction in GNN training time compared to state-of-the-art methodologies.

Tutorial

Deep Learning at Scale

Artificial Intelligence/Machine Learning

Performance Optimization

TUT

DescriptionDeep learning is rapidly and fundamentally transforming the way science and industry use data to solve problems. Deep neural network models have been shown to be powerful tools for extracting insights from data across a large number of domains, from large language models (LLMs) to protein folding. As these models grow in complexity to solve increasingly challenging problems with larger and larger datasets, the need for scalable methods and software to train them grows accordingly.

The Deep Learning at Scale tutorial aims to provide attendees with a working knowledge of deep learning on HPC-class systems, including core concepts, scientific applications, performance optimization, tips, and techniques for scaling. We will provide training accounts on some of the worlds largest GPU systems, example code, and datasets to allow attendees to experiment hands-on with optimized, scalable distributed training of deep neural network machine learning models from real scientific computing applications.

Workshop

DeepSpeed4Science: Enabling Future Large-Scale Scientific Discovery through Sophisticated AI System Technologies

Distributed Computing

Middleware and System Software

Runtime Systems

Exhibitor Forum

Defining the Quantum Accelerated Supercomputer

Exascale

Programming Frameworks and System Software

Quantum Computing

XO/EX

DescriptionSupercomputing architectures based on GPU acceleration have greatly improved our scientific computing workflows and applications over the past decade. Quantum computing has recently been proposed as a potential addition to this heterogeneous compute architecture, serving as another node-level accelerator to continue problem scalability in domains such as quantum many-body physics and artificial intelligence. As stand-alone quantum processing units (QPUs) continue to evolve and improve, the applied computational science community is left to wonder - how do we build, program, and deploy large-scale quantum-classical heterogeneous architectures that incorporate both GPUs and QPUs? In this talk, we will demonstrate how NVIDIA is leveraging its current suite of multi-GPU platforms to define and deploy the NVIDIA quantum platform. We will highlight three components specifically that together constitute this quantum platform: (1) the cuQuantum multi-GPU quantum computer simulation libraries, (2) the CUDA Quantum programming model and compilation platform, and (3) the DGX Quantum tightly-coupled quantum-classical compute node. This talk will present the NVIDIA vision for quantum computing and how it fits into existing heterogeneous computing, how we are accelerating quantum algorithms research and development today with NVIDIA GPU platforms, and our vision for GPU-accelerated error correction and fault-tolerance.

Posters

Research Posters

Delivering Digital Skills Across the Digital Divide: Creating an Accessible On-Demand Self-Paced HPC Virtual Training Lab

XO/EX

DescriptionThe training of new and existing HPC practitioners is recognized as a priority in the HPC community. Traditionally, delivering HPC System Administrator training has been through physical face-to-face workshops, using cloud-based services or remote hardware to provide compute resources to emulate an HPC system. There are several challenges associated with this approach, including class size limits, available compute resources, and disrupting work hours to attend training. By following lessons learned from MOOC methodology on developing HPC Training we have produced a reproducible, accessible, self-paced HPC virtual training lab that emulates a basic 3-node compute cluster on a trainee’s local machine without the need for any high-end computing resources or cloud infrastructure.

Our poster will provide an overview of the project, inter alia the delivery platforms, components and features of the lab, lessons learned and future improvements, as well as future plans for extended HPC training modules following this delivery format.

Workshop

Delivering Rules-Based Workflows for Science

Data Analysis, Visualization, and Storage

Large Scale Systems

Programming Frameworks and System Software

Reproducibility

Resource Management

Runtime Systems

DescriptionRules-based workflow scheduling is a recently developed method for constructing an analysis structure, in a far more dynamic manner than traditional graph based systems. However, rules-based workflows are still in their relative infancy and lack the breadth of features available in traditional scientific workflow systems. We will address some of these missing features by introducing the new meow_base library for generic construction of rules-based systems while meeting the requirements of a scientific workflow management system. It will also present two example workflows, showing how rules-based systems can better enable analysis loops or human-in-the-loop interactions than more traditional workflow systems

Workshop

Democratizing HPC Access and Use with Knowledge Graphs

State of the Practice

DescriptionThe field of High-Performance Computing (HPC) is undergoing rapid evolution, with an expanding and diverse user base harnessing its computational capabilities. As the range of HPC applications grows, newcomers to the field are faced with the daunting task of optimizing their applications for efficient execution on HPC systems. Traditional documentation, often spanning dozens of pages, is cumbersome for finding answers and ill-suited for integration with emerging conversational AI-powered user interfaces like chatbots. Addressing this challenge, we propose a novel HPC ontology crafted to encapsulate HPC runtime relations in a scalable fashion.

Our proposed ontology not only facilitates the transfer and querying of this knowledge but also serves as a foundational pillar for our AI-powered Speech Assistant Interface (SAI). This ensures reproducibility, reliability, and optimal performance when executing tasks. In this paper, we elucidate the relationships and properties underpinning our ontology and showcase how users can interact with knowledge graphs.

Workshop

Democratizing HPC by Building a Diverse and Inclusive Workforce

State of the Practice

Workshop

Democratizing Remote HPC Storage Access

Data Movement and Memory

State of the Practice

DescriptionAccessing HPC storage remotely can be cumbersome and involve using out-of-band tools (i.e. NFS, SCP, or SSHFS on Windows or SMB on Linux). We have begun to provide users access to our HPC storage using a tool that enables a familiar interface and behavior - like OneDrive or Dropbox - and unifies access to university-wide storage pools. We will walk through the software, configuration, lessons learned, and next steps in offering this service to our researchers. We are also exploring the use of built-in file tagging and other internal automation to provide a sensitive data workflow for HIPAA-aligned data security. Efficacy and lessons learned from this approach will be discussed.

Workshop

Democratizing Science through Equitable Access to Computing and Data

State of the Practice

Workshop

Demonstrating Cross-Facility Data Processing at Scale with Laue Microdiffraction

Large Scale Systems

Performance Measurement, Modeling, and Tools

Software Engineering

DescriptionIn February and April 2023 live, at-scale data processing demonstrations were conducted between the Advanced Photon Source (APS), a synchrotron light source, and the Argonne Leadership Computing Facility (ALCF). These tests were run as part of a novel beamline technique: coded aperture laue micro-diffraction. This technique requires a significant amount of compute to decode appeture patterns embedded in the detector stream. An autonomous system was able to send data to ALCF during an experiment, utilize 50 nodes of the Polaris supercomputer to process 6-12 hour scans, and return the data back to the APS within 12-15 minutes behind the detector. With scan points arriving every 72 seconds, the system kept up with the beamline, potentially enabling in-experiment analysis. The data processing system utilizes Globus infrastructure and an on-demand queue to dynamically acquire nodes on Polaris. The underlying reconstruction algorithms were parallelized via MPI and accelerated with custom CUDA kernels.

Workshop

Demonstration of Portable Performance of Scientific Machine Learning on High Performance Computing Systems

Applications

Distributed Computing

Large Scale Systems

Programming Frameworks and System Software

Runtime Systems

DescriptionWith the largest datasets to date and a diverse set of discoveries to be made, the current generation of scientific analyses are well poised to utilize artificial intelligence (AI) and machine learning (ML) on high performance computing (HPC) resources. Like never before, these workflows can be written in one portable language, Python, which thanks to highly-optimized ML libraries achieves excellent cross-platform performance with little to no intervention by the user. In this demonstration, we explore the performance of several scientific AI/ML applications across leading HPC resources and highlight best practices for portable performance.

Paper

Demystifying and Mitigating Cross-Layer Deficiencies of Soft Error Protection in Instruction Duplication

Accelerators

Artificial Intelligence/Machine Learning

Codesign

Fault Handling and Tolerance

Performance Measurement, Modeling, and Tools

Post-Moore Computing

DescriptionSoft errors are prevalent in modern High-Performance Computing (HPC) systems, resulting in silent data corruptions (SDCs), compromising system reliability. Instruction duplication is a widely used software-based protection technique against SDCs. Existing instruction duplication techniques are mostly implemented at LLVM level and may suffer from low SDC coverage at assembly level. In this paper, we evaluate instruction duplication at both LLVM and assembly levels. Our study shows that existing instruction duplication techniques have protection deficiency at assembly level and are usually over-optimistic in the protection. We investigate the root-causes of the protection deficiency and propose a mitigation technique, Flowery, to solve the problem. Our evaluation shows that Flowery can effectively protect programs from SDCs evaluated at assembly level.

Exhibits

Flash Session

Deploying One of the First NVIDIA GH200 Grace Hopper Superchip Clusters in Lambda Cloud

XO/EX

DescriptionLambda is introducing one of the first NVIDIA GH200 GPU clusters in its cloud, featuring NVIDIA’s new ARM-based Grace CPU for enhanced efficiency and coherent NVIDIA NVLink-C2C interconnect that provides 900 GB/s of bandwidth between the Grace CPU and Hopper GPU. The presentation will discuss Lambda's cluster design compared to x86-based GPU clusters.

Exhibits

Flash Session

Deploying One of the First NVIDIA GH200 Grace Hopper Superchip Clusters in Lambda Cloud

XO/EX

Exhibits

Flash Session

Deploying One of the First NVIDIA GH200 Grace Hopper Superchip Clusters in Lambda Cloud

XO/EX

Workshop

Design and Analysis of the Network Software Stack of an Asynchronous Many-Task System – The LCI Parcelport of HPX

Applications

Architecture and Networks

Distributed Computing

Compilers

Heterogeneous Computing

Message Passing

Programming Frameworks and System Software

Task Parallelism

DescriptionThe HPX asynchronous many-task runtime system has been using TCP and MPI as its communication backends (parcelports). We developed a new HPX parcelport using a new communication library, the Lightweight Communication Interface (LCI) that was designed to better match the needs of systems such as HPX. We evaluate its performance with various microbenchmarks and a real-world astrophysics application, Octo-Tiger. Compared to the best configuration of the MPI parcelport, microbenchmarks show that the new LCI parcelport improves the message rate by up to 30x and decreases latencies by up to 5x. It also reduces the total execution time of Octo-Tiger by up to 1.175x compared to the best configuration of the MPI parcelport and up to 13.6x compared to the same configuration of the MPI parcelport. We discuss the performance impacts of different design choices.

Doctoral Showcase

Posters

Design Automation Tools and Software for Quantum Computing

Quantum Computing

DescriptionQuantum computing promises to solve problems beyond the reach of today’s machines, but it requires efficient and reliable software tools to realize its potential. This poster gives an overview of various contributions towards design automation methods and software for quantum computing that leverage existing knowledge and expertise in classical circuit and system design. It focuses on three major tasks: simulation, compilation, and verification of quantum circuits. The proposed solutions demonstrate significant improvements in efficiency, scalability, and reliability for all tasks and constitute the backbone of the Munich Quantum Toolkit (MQT), a collection of open-source tools for quantum computing. The respective solutions advance the state of the art in quantum computing and illustrate the benefits of design automation methods for this emerging field.

Paper

Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers

Accelerators

Architecture and Networks

Data Analysis, Visualization, and Storage

Fault Handling and Tolerance

DescriptionMulti-level erasure coding (MLEC) has seen large deployments in the field, but there is no in-depth study of design considerations for MLEC at scale. In this paper, we provide comprehensive design considerations and analysis of MLEC at scale. We introduce the design space of MLEC in multiple dimensions, including various code parameter selections, chunk placement schemes, and various repair methods. We quantify their performance and durability, and show which MLEC schemes and repair methods can provide the best tolerance against independent/correlated failures and reduce repair network traffic by orders of magnitude. To achieve this, we use various evaluation strategies including simulation, splitting, dynamic programming, and mathematical modeling. We also compare the performance and durability of MLEC with other EC schemes such as SLEC and LRC and show that MLEC can provide high durability with higher encoding throughput and less repair network traffic over both SLEC and LRC.

Workshop

Design of a Framework for Combined Flexible and Efficient Simulation and In Situ Processing

Data Analysis, Visualization, and Storage

Large Scale Systems

Performance Measurement, Modeling, and Tools

DescriptionThis paper presents preliminary work on a new in situ tool design
that allows flexibility in the type of in situ processing. The design
goal is to be able to specify at run-time whether to run in-line in situ
or in transit and to allow switching dynamically between in-line
in situ processing and in transit processing. By allowing the run
to dynamically switch between these two modes, as dictated by
current compute requirements of both the simulation and the in
situ tool, the machine resources can be utilized most efficiently. The
design uses a framework, or inversion of control, approach where
the allocation of MPI processes to simulation and in situ tool is
controlled by the framework. Initial work demonstrates the ability
to switch from an in transit paradigm to an in-line in situ paradigm
with two separate simulation codes, using ParaView Catalyst as
the in situ engine.

Birds of a Feather

Designing HPC Outreach Activities

Education

XO/EX

DescriptionHPC Outreach is essential to enthusing young minds about computational science, informing the public and growing the HPC community, and yet many institutions do not have sufficient funding or staff effort to support the outreach activities. Effective outreach requires well designed activities that are suitable to the target audience and event type. Different activities are needed for different age groups, scientific backgrounds or venues. Each activity also has its own lifecycle and cannot be reused indefinitely. The goal of this session is to design several new activities that the community would be able to develop over the coming year.

Posters

Research Posters

Developing an Inverse Reinforcement Learning Methodology to Predict the Progression of Colorectal Cancer

XO/EX

DescriptionIn cancer biology, large amounts of high dimensional data (genomic, transcriptomic, proteomic, phenotypic, etc.) are required for any computationally relevant work. The problem is further complicated by the sheer size of the human genome, roughly three billion base pairs long. Therefore, computation is time-consuming and data-intensive. To solve this problem for human colorectal cancer, we are implementing a machine learning engine based on inverse reinforcement learning, and includes several different kinds of neural networks to perform data preparation, training, and prediction. Our work aims to reconstruct the progression of tumor development in a sample, and predict the next steps of its evolution, to aid in diagnosis and treatment. This poster will be presented as a work in progress methodology.

Inclusivity

Developing Community and Pipeline through SCC

Inclusivity

TUT

XO/EX

DescriptionThe Student Cluster Competition is one of the most exciting components of the SC Conference. Students who are selected to participate prepare all year to compete, and in this the process, they acquire valuable skills in the field of HPC that are highly sought after by prestigious hiring institutions and tech companies. However, students who have access to these competitions are typically male and come from an elite set of institutions that participate each year. In this session, we will explore paths to open the SCC at SC to a wider set of demographics and institutions so that more students get to experience the excitement of the SCC, can acquire valuable HPC skills that employers want, and can achieve social mobility as a result of a sustained SCC pipeline.

Workshop

DevOps Approaches for Interconnected Science Ecosystems

Large Scale Systems

Performance Measurement, Modeling, and Tools

Software Engineering

Posters

Research Posters

DFToy: A New Proxy App for DFT Calculations

XO/EX

DescriptionDensity functional theory based codes are significant users of HPC resources, often ranking among the top users of core hours on these systems. However, despite their popularity and resource usage, they are not very well optimised for current HPC architectures - and are not easily adapted. We present DFToy, a new proxy-app for DFT codes that is accessible, easy to understand and FOSS. DFToy's accessibility makes it an excellent platform for benchmarking, experimentation and development - allowing developers to research novel algorithms for DFT codes.

We will show DFToy's use and capabilities in its current state, compare its behavior to a state-of-the-art DFT code, and discuss where we will take the code going forward - including the development of a self-tuning parallel model.

Paper

DGAP: Efficient Dynamic Graph Analysis on Persistent Memory

Architecture and Networks

Data Movement and Memory

Graph Algorithms and Frameworks

Performance Measurement, Modeling, and Tools

Programming Frameworks and System Software

DescriptionDynamic graphs have grown in importance for numerous real-world applications. To accommodate this, graph frameworks, particularly their internal data structures, must support both persistent graph updates and rapid graph analysis simultaneously. Emerging persistent memory technologies, such as Optane DCPMM, offer a promising choice to simplify the designs by providing data persistence, low latency, and high IOPS together. We propose DGAP, a framework for efficient dynamic graph analysis on persistent memory. DGAP utilizes mutable Compressed Sparse Row (CSR) with new designs for persistent memory to construct the framework. Specifically, DGAP introduces a per-section edge log to reduce write amplification; a per-thread undo log to enable high-performance, crash-consistent rebalancing operations; and a data placement schema to minimize in-place updates. Our extensive evaluation results demonstrate that DGAP can achieve up to 3.2x better graph update performance and up to 3.77x better graph analysis performance compared to state-of-the-art dynamic graph frameworks for persistent memory.

Exhibitor Forum

Digital Twins for Science

Artificial Intelligence/Machine Learning

Fault Handling and Tolerance

Large Scale Systems

Programming Frameworks and System Software

XO/EX

DescriptionDigital Twins have emerged as one of the hot new modeling concepts in HPC as we enter the Post Exascale Era. Originally conceived as a modeling tool for manufacturing and Product Life Cycle Management Digital Twins are evolving in HPC with the convergence of simulation, machine learning and live data. The introduction of machine learning into the HPC workflows has been a critical component in the evolution of the Digital Twin for Science where for the first time models with fidelity at the atomic level can scale to the full scope of a physical object or system.

The talk will include examples from physics, biology and climate science and describe how the NVIDIA platform can address the requirements for first principles simulation using the HPC SDK, the RAPIDS SDK and Modulus to develop the robust machine learning models, Holoscan our tool set to enable data acquisition from live data sources and Omniverse our SDK to aggregate the workflow components and visualize it.

Workshop

Digital Twins in Construction, Smart Cities & Public Policy - Dimitrios Rovas

State of the Practice

DescriptionDimitrios will provide his perspectives on the use of digital twins in construction, smart cities & public policy and the practices and principles developed therein.

Workshop

Digital Twins in Healthcare - Mariano Vazquez

State of the Practice

DescriptionMariano will provide his perspectives on the use of digital twins in healthcare and principles developed in the practice of these tools and their impact on the greater society they serve.

Workshop

Digital Twins in Healthcare - Peter Coveney

State of the Practice

DescriptionPeter will provide perspectives on the use of digital twin technology in healthcare, current state of the art and progress towards a fully virtual human.

Workshop

Digital Twins in HPC – Panel Retrospective and Next Steps

State of the Practice

DescriptionThis session will include all organizers and presenters in a retrospective look at the materials presented, questions/issues raised and what next steps can be taken to move towards a wider use of digital twins in HPC in these domains and more.

Workshop

Digital Twins in Oceanography & Climate Science - Ioan Hadade

State of the Practice

DescriptionIoan will present his perspectives on the use of digital twins in climate science and oceanography, including practices and principles developed therein.

Workshop

Digital Twins in Oceanography - Chris Hill

State of the Practice

DescriptionDr. Chris Hill will provide his perspectives on the state of the are of using digital twins in climate science and ocean modeling, including practices and principles developed therein.

Workshop

Digital Twins Workshop Break (15 min)

State of the Practice

Workshop

Disk Failure Trends in Alpine Storage System

Fault Handling and Tolerance

Large Scale Systems

DescriptionLarge-scale HPC systems demand extensive disk-based storage for data generated by HPC applications, necessitating scalable reliability, availability, and failure management. Extracted failure data from HPC storage offers valuable insights for preventing and managing failures, spanning understanding storage robustness, guiding system design and deployment, and creating durable data protection schemes. This paper introduces a failure dataset from OLCF’s Summit supercomputer's file system, Alpine, encompassing 4000+ events over 2.75 years from 32000+ disks. Before analysis, we delve into Alpine's components and introduce IBM Spectrum Scale technology, then assess collected data for failure distribution and burst correlations. We infer that, proximity to enclosure fan modules heightens disk failure rates. Also, burst failure analysis highlights 1/3rd of failures occurring in bursts, with 90% non-spatially correlated, impacting multiple racks.

Workshop

Distinguished Speaker: GPU Centric Communication – Is MPI Missing Out?

Exascale

Message Passing

Programming Frameworks and System Software

Workshop

Distinguished Speaker: The ‘S’ in HPC Stands for Security

Distributed Computing

Security

DescriptionHPC systems are designed to meet peak performance and scalability goals but today’s security guidance and tools are designed for enterprise infosec. This means that it is quite difficult to secure HPC resources without impacting performance goals. In this talk, we will examine the key security differences between enterprise systems and the common features of HPC environments. We will also discuss a new HPC Security NIST publication (currently draft) and touch on how secure ‘open science’ research really needs to be. From there, we will explore emerging trends to keep track of such as scientific workflows that span multiple security domains and whether trusted computing and zero trust models can be adapted to HPC. Finally, we will demonstrate one example of a zero-day vulnerability found on a previous #1 top 500 system (disclosed and patched in 2018) to help motivate broader action to put an ‘S’ in HPC.

Workshop

Distributed Data Locality-Aware Job Allocation

Data Analysis, Visualization, and Storage

Large Scale Systems

Programming Frameworks and System Software

Reproducibility

Resource Management

Runtime Systems

DescriptionScheduling tasks close to their associated data is crucial in distributed systems to minimize network traffic and latency. Some Big Data frameworks like Apache Spark employ locality functions and job allocation algorithms to minimize network traffic and execution times. However, these frameworks rely on centralized mechanisms, where the master node determines data locality by allocating tasks to available workers with minimal data transfer time, ignoring variances in worker configurations and availability. To address these limitations, we propose a decentralized approach to locality-driven scheduling that grants workers autonomy in the job allocation process while factoring in workers' configurations, such as network and CPU speed differences. Our approach is developed and evaluated on Crossflow, a distributed stream processing platform with data-aware independent worker nodes. Preliminary evaluation experiments indicate that our approach can yield up to 3.57x faster execution times when compared to the baseline centralized approach where the master controls data locality.

Workshop

Distributing Circuits Over Heterogeneous, Modular Quantum Computing Network Architectures

Quantum Computing

Software Engineering

DescriptionWe consider the automated compilation of quantum circuits to heterogeneous networks of quantum computing modules, sparsely connected via Bell states. A circuit too large to be implemented on any one module alone requires the insertion of operations, typically gate teleportation or qubit teleportation, consuming Bell states. Here we focus on the use of simultaneous teleportation of CZ gates. Inter-module operations constitute a computational bottleneck, and are likely to add more noise to the computation than intra-module operations. In this work we introduce techniques for distributing quantum circuits, in a way which minimizes the number of Bell states required to do so. We present pytket-dqc, a python library containing implementations of our techniques; simplifying, automating and making accessible quantum circuit distribution.

Paper

DistTGL: Distributed Memory-Based Temporal Graph Neural Network Training

Artificial Intelligence/Machine Learning

DescriptionMemory-based Temporal Graph Neural Networks are powerful tools in dynamic graph representation learning and have demonstrated superior performance in many real-world applications. However, their node memory favors smaller batch sizes to capture more dependencies in graph events and needs to be maintained synchronously across all trainers. As a result, existing frameworks suffer from accuracy loss when scaling to multiple GPUs. Even worse, the tremendous overhead to synchronize the node memory make it impractical to be deployed to distributed GPU clusters.

In this work, we propose DistTGL --- an efficient and scalable solution to train memory-based TGNNs on distributed GPU clusters.
DistTGL has three improvements over existing solutions: an enhanced TGNN model, a novel training algorithm, and an optimized system. In experiments, DistTGL achieves near-linear convergence speedup, outperforming state-of-the-art single-machine method by 14.5% in accuracy and 10.17x in training throughput.

Workshop

Divide and Conquer: Scaling AI via Federated Learning and Distributed Alternatives to Backpropagation

Artificial Intelligence/Machine Learning

Distributed Computing

Workshop

DLSIA: Deep Learning for Scientific Image Analysis

Large Scale Systems

Performance Measurement, Modeling, and Tools

Software Engineering

Workshop

Domain-Aware Performant AI-Based Compression

Data Analysis, Visualization, and Storage

Data Movement and Memory

DescriptionMicroscopes play a critical role in scientific discoveries. A grand challenge in microscopy-based research is to manage the significantly high data volume and velocity of data generation while ensuring real-time analysis and closed-loop microscope operation. The scanning transmission electron microscope (STEM) at PNNL can generate 1100 frames per second with 128x128 pixels per frame, which leads to more than 5 GB of data generation per minute. The future microscopes could well support even higher data rates. The current data processing solutions utilize sampling to manage such a high data volume and velocity. However, it's important to note that sampling can lead to the loss of critical information. Consequently, complete data processing and analytics are necessary. Data compression techniques that reduce data size while retaining key properties on individual images appear to be a promising solution.

Workshop

Domain-Specific Energy Modeling for Drug Discovery and Magnetohydrodynamics Applications

Artificial Intelligence/Machine Learning

Energy Efficiency

Green Computing

Performance Measurement, Modeling, and Tools

Sustainability

DescriptionFrequency scaling is a well-known energy-saving power management technique that modulates the device frequency to explore the trade-off between energy and performance. Higher energy savings require a frequency tuning phase since different applications can have different energy and time behavior depending on the frequency setting. Machine learning models can be used to predict the optimal frequency configuration based on static or dynamic features extracted from the target application. While general-purpose energy models can be very accurate on a wide range of applications their accuracy can be limited by the specific input of the target application. We present an energy characterization that spans the fields of drug discovery and magnetohydrodynamics by using two real-world applications as case studies: LiGen and Cronos. To overcome the limitations of general-purpose approaches, we define two domain-specific energy models, which enhance the general-purpose energy models by leveraging the target application's input parameter to increase the accuracy.

Workshop

Domain-Specific Programming Methodologies for Domain-Specific and Emerging Computing Systems

Large Scale Systems

Middleware and System Software

Programming Frameworks and System Software

DescriptionProgramming heterogeneous computing systems is a daunting task which is becoming even more challenging with the advent of emerging, non Von-Neumann computer architectures. Innovation in programming abstractions and compilers are thus badly needed to cope with the current golden age of computer architecture. This talk discusses domain-specific abstractions and languages as a promising avenue to hide the system complexity from non-expert programmers while passing richer information to compilers. The high-level semantics in DSLs improves productivity while enabling coarser-grained optimization and safer code generation. Examples are provided from the domains of big-data, physics simulations and machine learning, targeting modern reconfigurable hardware, for emerging memory technologies and for emerging in-memory computing.

Paper

DPS: Adaptive Power Management for Overprovisioned Systems

Architecture and Networks

Performance Measurement, Modeling, and Tools

Resource Management

DescriptionMaximizing performance under a power budget is essential for HPC systems and has inspired the development of many power management frameworks. These can be broadly characterized into two groups: model-based and stateless. Model-based frameworks achieve good performance under a power budget but are highly dependent on the quality of the model and the data used to train it. Stateless frameworks are more robust and require no training, but are generally lower performance. In this paper, we propose a new framework that does not require a model, but does track state in the form of recent power dynamics. We implement this idea and test it on a public cloud running both Spark and HPC jobs. We find when total power demand is low, our framework achieves equivalent performance to prior work, but when power demand is high it achieves a mean 8% performance improvement (with no reliance on a learned model).

Workshop

DPU Offloading Programming with the OpenMP API

Compilers

Heterogeneous Computing

Performance Optimization

DescriptionDPUs as network co-processors are an emerging trend in our community. These have been generally used as domain-specific accelerators transparent to application developers; In the HPC field, DPUs have been used as MPI accelerators, but also to offload some tasks from the general-purpose processor. However, the latter required application developers to deploy MPI ranks in the DPUs, as if they were remote (weak) compute nodes, hence considerably hindering programmability. The wide adoption of OpenMP as the threading model in the HPC arena, along with that of GPU accelerators, is making OpenMP offloading to GPUs a wide trend for HPC applications. In this paper we introduce, for the first time in the literature, OpenMP offloading support for network co-processor DPUs. We present our design in LLVM to support OpenMP standard offloading semantics and discuss the programming productivity advantages with respect to the existing MPI-based programming model.

Workshop

Dragon Proxy Runtimes and Multi-System Workflows

Applications

Distributed Computing

Large Scale Systems

Programming Frameworks and System Software

Runtime Systems

DescriptionWe present a novel method for obtaining proxy access to remote instances of the Dragon distributed runtime. Dragon is a composable distributed runtime for managing dynamic processes, high-performance communication objects, memory and data at scale that is based on an abstraction of a distributed system. Proxy access to a remote instance of the Dragon runtime allows the client Dragon runtime to run any command that could be run directly by the remote Dragon runtime, but executes the command on the remote runtime. Commands to be run on a remote Dragon runtime are mediated by a Python object that acts as a proxy for the remote runtime, which we call a proxy runtime. These proxy runtimes, combined with the ability to start and tear down remote Dragon runtimes both programmatically and via the command line interface, make a number of challenging workflows simple to program.

Exhibits

Flash Session

Driving the AI Revolution with Azure Supercomputing

XO/EX

DescriptionWe will discuss the growth of AI, especially LLMs and generative AI, and the supercomputing making this possible. Azure HPC provides purpose-built supercomputing infrastructure to support training/tuning of foundational AI models, plus HPC infrastructure to support inferencing as consumers in all industries use AI models to assist their everyday productivity.

Exhibits

Flash Session

Driving the AI Revolution with Azure Supercomputing

XO/EX

Exhibits

Flash Session

Driving the AI Revolution with Azure Supercomputing

XO/EX

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Dynamic and First-Class Priorities

XO/EX

DescriptionInteractive parallel programs have varying responsiveness requirements for tasks of differing urgency, which has been met with the solution of thread priorities to determine the tasks' allocation of processor time. Previous priority-based language models limit the span of entire threads to a single priority. Given an approaching real-time deadline, tasks are unable to shift to a higher priority in order to match the changing requirements. We design a type system that enforces thread priorities and allows dynamic prioritization, treating priorities as first-class to reduce code complexity. We create a dependency graph-based cost model for our system and define strong well-formedness to exclude unwanted priority inversions. We then prove that programs under our type system produce strongly well-formed graphs.

Workshop

Dynamic Data Structures on the GPU

Graph Algorithms and Frameworks

Linear Algebra

Programming Frameworks and System Software

State of the Practice

DescriptionIf I had to sum up the purpose of data structures in three words, I'd say they "organize sparse data". And as we know, the future is sparse. Dynamic data structures allow updates to data structures without having to rebuild them completely. In my talk, I'll discuss recent progress in designing and implementing dynamic data structures for GPUs, whose implementations present significant challenges. I'll talk briefly about work in log-structured merge trees, quotient filters, linked lists, hash tables, dynamic graphs, and B-trees, but mostly about principles that we followed in building them and what we learned from the experience.

Workshop

Dynamic Memory Provisioning on Disaggregated HPC Systems

Applications

Data Movement and Memory

Heterogeneous Computing

I/O and File Systems

Large Scale Systems

Middleware and System Software

Performance Measurement, Modeling, and Tools

Performance Optimization

DescriptionDisaggregated memory intends to break the rigid boundaries between node memory hierarchies by providing memory as a pooled resource. The resource manager allocates system’s memory at job’s submission time. But it is hard for users to know the job's precise peak memory footprint, and prior work has shown users have an incentive to overestimate. It leads to significant overallocation, and most of the physical memory in the system is wasted. We present a way to reclaim much of this overallocated memory. We extend the Slurm job scheduler to dynamically reallocate memory, according to the job’s current memory footprint. We enhance an existing Slurm simulator to model this situation and combine publicly available traces to model an HPC system on up to 1490 nodes. We show that dynamic memory provisioning approach increases the throughput per dollar by up to 38%, compared to a system with static allocation of disaggregated memory.

Workshop

Dynamic Selective Protection of Sparse Iterative Solvers via ML Prediction of Soft Error Impacts

Fault Handling and Tolerance

Large Scale Systems

DescriptionSoft errors occur frequently on large computing platforms due to the increasing scale and complexity of HPC systems. Various resilience techniques have been proposed to protect scientific applications from soft errors. Among them, system-level replication often involves duplicating or triplicating the entire computation, resulting in high resilience overhead. This paper proposes dynamic selective protection for sparse iterative solvers, in particular for the Preconditioned Conjugate Gradient (PCG) solver, at the system level to reduce the resilience overhead. We leverage machine learning (ML) to predict the impact of soft errors that strike different elements of a key computation at different iterations of the solver. Based on the result of the prediction, we design a dynamic strategy to selectively protect those elements that result in a large performance degradation if struck by soft errors. An experimental evaluation demonstrates that our dynamic protection strategy reduces the resilience overhead compared to existing algorithms.

Posters

Research Posters

Early Experience in Characterizing Training Large Language Models on Modern HPC Clusters

XO/EX

DescriptionIn the realm of natural language processing, Large Language Models (LLMs) have emerged as powerful tools for tasks such as language translation, text generation, and sentiment analysis. However, the immense parameter size and complexity of LLMs present significant challenges. This work delves into the exploration and characterization of high-performance interconnects in the distributed training of various LLMs. Our findings reveal that high-performance network protocols, notably RDMA, significantly outperform other protocols like IPoIB and TCP/IP in training performance, offering improvements by factors of 2.51x and 4.79x respectively. Additionally, we observe that LLMs with larger parameters tend to demand higher interconnect utilization. Despite these findings, our study suggests potential for further optimization in overall interconnect utilization. This research contributes to a deeper understanding of the performance characteristics of LLMs over high-speed interconnects, paving the way for more efficient training methodologies.

Workshop

Early Experiences with Charliecloud for HPC

DescriptionContainers have become an increasingly important part of the research software ecosystem for enabling portability and reproducibility, especially for cloud-native workflows. However, challenges remain for facilitators working to bring the success of containers in the cloud to their local HPC clusters due to strict security and performance requirements. Established container runtimes such as Singularity sacrifice features in order to meet these requirements, leaving some researcher’s needs unfilled. Charliecloud, a newer container runtime, offers a more complete container experience for HPC by being fully unprivileged at all stages of container development, testing, and production. TAMU HPRC, an early adopter of Charliecloud, reports on its experiences in supporting Charliecloud for HPC to provide guidance and inspiration for other institutions considering supporting Charliecloud. Lessons learned about installation, usage, best practices, applications, and user training are described, as well as recommendations for further development.

Paper

EasyScale: Elastic Training with Consistent Accuracy and Improved Utilization on GPUs

Artificial Intelligence/Machine Learning

DescriptionDistributed synchronized GPU training is commonly used for deep learning. The resource constraint of using a fixed number of GPUs makes large-scale training jobs suffer from long queuing time for resource allocation, and lowers the cluster utilization. Adapting to resource elasticity can alleviate this but often introduces inconsistent model accuracy, due to lacking of capability to decouple model training procedure from resource allocation. We propose EasyScale, an elastic training system that achieves consistent model accuracy under resource elasticity for both homogeneous and heterogeneous GPUs. EasyScale preserves the data-parallel training behaviors strictly, traces the consistency-relevant factors carefully, utilizes the deep learning characteristics for EasyScaleThread abstraction and fast context-switching. To utilize heterogeneous cluster, EasyScale dynamically assigns workers based on the intra-inter-job schedulers, minimizing load imbalance and maximizing aggregated job throughput. Deployed in an online serving cluster, EasyScale powers the training jobs to utilize idle GPUs opportunistically, improving overall cluster utilization by 62.1%.

Workshop

eBPF-Based Performance Fingerprint of Containerized HPC Applications

DescriptionHighly optimized systems face the challenge of meeting the computational demands of domain-specific simulations and workflow applications. Those range from low- to high-level implementations and differ widely regarding optimization and requirements, and affecting increasingly the centers' energy efficiency. With today’s heterogeneous clusters, containerization allows arbitrary applications to bypass architectural differences, and focus mainly on core-functions instead of deployment issues.

Our framework determines general performance characteristics of containerized HPC-applications with unknown behavior, and satisfies computational, memory, and interconnect requirements independent of the container-technology. It allows developers and administrators to evaluate runtime parameters of blackbox applications, without inspecting the inside of any container based on kernel-level measurements with eBPF. We derived first algorithms for the Container-Fingerprint, a quantified and comparable runtime characteristic, enabling optimized mapping to target systems.

For evaluation we investigate benchmark applications on different architectures typical for supercomputing centers. Measurements indicate that the derived fingerprints are suitable to distinguish the performance of containers and systems allowing an optimized allocation of HPC-containers. By applying the scheme we work towards a twofold improvement: an increase in the efficiency of system usage and energy consumption, and a deployment optimization of containers by enabling streamlined, requirements-oriented allocations optimizing the balance between resource-usage and time-to-solution.

Early Career Program

Inclusivity

ECP Mentor Protégé Mixer

Inclusivity

DescriptionThis is a succession of conversations between a mentor and a small group of mentees. Mentees rotate the tables having an opportunity to talk to several mentors during a specified amount of time asking questions and trying to establish a connection.

Workshop

EduHCP23 – Welcome Remarks

Education

State of the Practice

DescriptionEduHPC23 – Welcome and Introductory Message

Workshop

EduHPC-23 – Afternoon Break

Education

State of the Practice

Workshop

EduHPC-23: Workshop on Education for High Performance Computing

Education

State of the Practice

DescriptionThe EduHPC workshop brings together stakeholders from industry (developers, hardware and software vendors), national labs, and academia in the context of SC, to hear the pedagogical challenges others are facing, share approaches to meeting such challenges, and generally exchange ideas related to high-performance computing, parallel and distributed computing, distributed data science, scalable AI and IoT/Edge computing in undergraduate and graduate education. In addition to paper presentations, this workshop will feature invited keynotes, panels (e.g., reproducibility in HPC education and training, inclusive pedagogy and efforts in broadening participation in HPC), special sessions such as “Peachy Assignments,” and invited talks on opportunities for collaboration, resource sharing, educator training, internships, and other means of increasing cross-fertilization between industry, government, and academia.

Workshop

EduHPC23 – Invited Talk by Kathy Yelick: Educating Post Exascale HPC Leaders

Education

State of the Practice

DescriptionThe first generation of exascale computing systems is coming online along with new application capabilities and system software. At the same time, demands for high performance computing continue to grow for more powerful simulations, adoption of machine learning methods, and huge data analysis problems arising from new instruments and increasingly ubiquitous devices. In its broadest sense, computational science research is expanding beyond physical and life sciences into social sciences, public policy, and even the humanities.

Concurrent with these trends, chip technology is facing scaling limits, making it increasingly difficult to meet these new demands. Disruptions in the computing marketplace, which include supply chain limitations, a shrinking set of system integrators, and the growing influence of cloud providers are changing underlying assumptions about how to acquire and deploy future supercomputers. At the same time, there are discussions around the role of AI/ML and quantum computing.

How do we educate students for a post-exascale world? A finite set of computational motifs represent much of the parallel computing workload in modeling and simulation. Should the HPC community focus on those or should they be expanded to include data analytics and machine learning approaches? Finally, what are the workforce needs for the future of high end computing?

Workshop

EduHPC23 – Panel Q&A: Paper Session II

Education

State of the Practice

DescriptionPanel Q&A for EduHPC23 paper session II

Workshop

EduHPC23 – Panel Q&A: Lightning Talks

Education

State of the Practice

DescriptionPanel Q&A for EduHPC23 Lightning Talks.

Workshop

EduHPC23 – Panel Q&A: Peachy Assignments

Education

State of the Practice

DescriptionPanel Q&A for EduHPC23 Peachy Assignment Session

Workshop

EduHPC23: Panel Q&A Paper Session I

Education

State of the Practice

DescriptionPanel Q&A for EduHPC23 paper session I

Posters

Research Posters

EE-HPC – A Framework for Energy Efficient HPC System Operation

XO/EX

DescriptionThe energy consumption of HPC data centers is a decisive factor in the procurement and operation of the systems. EE-HPC achieves a more efficient energy use of HPC systems by targeted job-specific control and optimization of the hardware. The project started end of 2022 and builds on the existing stable software components ClusterCockpit and LIKWID. It provides a simple, robust, secure and scalable monitoring and energy control solution for hybrid HPC clusters. The job-specific performance and monitoring framework ClusterCockpit is already used in production at several large HPC computing centers. The energy manager and node controller is implemented in a Python based prototype and will be ported to Golang and integrated in ClusterCockpit. The framework will be evaluated with a set of relevant HPC applications from molecular dynamics, engineering, and climate research.

Workshop

Efficient Data Redistribution for Malleable Applications

Exascale

Message Passing

Programming Frameworks and System Software

DescriptionProcess malleability can be defined as the ability of a distributed MPI parallel job to change the number of processes on--the--fly without stopping its execution, reallocating the compute resources originally assigned to the job, and without storing application data to disk. MPI malleability consists of four stages: resource reallocation, process management, data redistribution and execution resuming. Among them, data redistribution is the most time-consuming and determines the reconfiguration time.

In this work, we compare different implementations of this stage using point-to-point and collective MPI operations, and discuss the impact of overlapping computation-communication. We then combine these strategies with different methods to expand/shrink jobs, using a synthetic application to emulate MPI-based codes and their malleable counterparts, in order to evaluate the effect of different malleability methods in parallel distributed applications. The results show that the use of asynchronous techniques speeds up execution by 1.14 and 1.21, depending on the network used.

Tutorial

Efficient Distributed GPU Programming for Exascale

Accelerators

Exascale

Heterogeneous Computing

Performance Optimization

TUT

DescriptionOver the past decade, GPUs became ubiquitous in HPC installations around the world, delivering the majority of performance of some of the largest supercomputers (e.g. Summit, Sierra, JUWELS Booster). This trend continues in the recently deployed and upcoming Pre-Exascale and Exascale systems (JUPITER, LUMI, Leonardo; El Capitan, Frontier, Perlmutter): GPUs are chosen as the core computing devices to enter this next era of HPC. To take advantage of future GPU-accelerated systems with tens of thousands of devices, application developers need to have the proper skills and tools to understand, manage, and optimize distributed GPU applications.

In this tutorial, participants will learn techniques to efficiently program large-scale multi-GPU systems. While programming multiple GPUs with MPI is explained in detail, also advanced tuning techniques and complementing programming models like NCCL and NVSHMEM are presented. Tools for analysis are shown and used to motivate and implement performance optimizations. The tutorial teaches fundamental concepts that apply to GPU-accelerated systems in general, taking the NVIDIA platform as an example. It is a combination of lectures and hands-on exercises, using one of Europe’s fastest supercomputers, JUWELS Booster, for interactive learning and discovery.

Paper

Efficient Maximal Biclique Enumeration on GPUs

Accelerators

Algorithms

Graph Algorithms and Frameworks

DescriptionMaximal biclique enumeration (MBE) in bipartite graphs is an important problem in data mining with many real-world applications. All existing solutions for MBE are designed for CPUs. Parallel MBE algorithms for GPUs are needed for MBE acceleration leveraging its many computing cores. However, enumerating maximal bicliques using GPUs has three main challenges including large memory requirement, thread divergence, and load imbalance. In this paper, we propose GMBE, the first highly-efficient GPU solution for the MBE problem. To overcome the challenges, we design a stack-based iteration approach to reduce GPU memory usage, a pro-active pruning method using the vertex’s local neighborhood size to alleviate thread divergence, and a load-aware task scheduling framework to achieve load balance among threads within GPU warps and blocks. Our experimental results show that GMBE on an NVIDIA A100 GPU can achieve 70.6× speedup over the state-of-the-art parallel MBE algorithm ParMBE on a 96-core CPU machine.

Workshop

Efficient Probabilistic Tuning of Ensemble Forecasting Method

Performance Optimization

DescriptionEnsemble forecasting techniques are gaining popularity in the weather and renewable energy communities, thanks to their ability to produce accurate predictions and at the same time to provide a measure of the uncertainty in the forecast. Analog Ensemble techniques are a class of computationally efficient ensemble forecasting methods that predict future weather events based on historical similar cases (i.e., analogs). The definition of "similar" is dependent on the type of predictors used for searching in the historical dataset, and on how relevant they are to identify a similar weather event happened in the past. For a given geographical location, the relevancy of a predictor in identifying good analogs requires a long tuning process usually performed via brute-force. In this work, we provide several probabilistic alternatives to the tuning process, based on the dataset size, computational cost of a single evaluation, and number of predictors.

Workshop

Elastic Deep Learning through Resilient Collective Operations

Artificial Intelligence/Machine Learning

DescriptionA robust solution that incorporates fault tolerance and elastic scaling capabilities for distributed deep learning. Taking advantage of MPI resilient capabilities, aka. User-Level Failure Mitigation (ULFM), this novel approach promotes efficient and lightweight failure management and encourages smooth scaling in volatile computational settings. The proposed ULFM MPI-centered mechanism outperforms the only officially supported elastic learning framework, Elastic Horovod (using Gloo and NCCL), by a significant factor. These results reinforce the capability of MPI extension to deal with resiliency, and promote ULFM as an effective technique for fault management, minimizing downtime, and thereby enhancing the overall performance of distributed applications, in particular elastic training in high-performance computing (HPC) environments and machine learning applications.

Workshop

Elevating the Undergraduate Internship: Five Strategies for Putting the “R” in RSE

Software Engineering

Workshop

Embedding Rust within Open MPI

Exascale

Message Passing

Programming Frameworks and System Software

DescriptionThe Message-Passing Interface (MPI) requires implementations that are able to adapt to new hardware and architectures while ensuring correctness and usability. The most widely used MPI implementations, however, are written in older programming languages that can lead to memory-unsafe code with poor isolation between modules, and complicated interfaces that can lead to serious bugs, all of which leads to difficulty in testing, debugging, and checking for correctness. In order to improve development of MPI implementations, we posit that new components, and key existing code segments, may benefit from being written in the Rust programming language. In this work, we re-implement a core component of Open MPI used for intra-node communication in Rust and show that it achieves performance approaching that of the existing, highly optimized, C code, demonstrating that Rust is able to provide performance while allowing for better testing, memory safety guarantees, and correctness.

Workshop

Embracing Batch on Kubernetes

Distributed Computing

State of the Practice

DescriptionAs the adoption of Kubernetes continues to grow, there is an increasing demand for performing larger scale batch processing using Kubernetes. Much of the initial workloads are around machine learning but there is also interest in converging traditional HPC and Kubernetes clusters for operational efficiencies.

In this talk, we want to look at how we can leverage both traditional HPC workload partitioning as well as features of Kubernetes to achieve a hybrid system that can be used for all types of workloads. We will show how we isolate the orchestration and user processes in Kubernetes allowing for maximum use of the hardware for running batch workloads with benchmark comparisons against a traditional HPC cluster.

It is important to note that this area of research is still in its early stages and through our exploration of this topic we hope that this will continue to foster discussion in the HPC community.

Paper

Embracing Irregular Parallelism in HPC with YGM

Distributed Computing

Message Passing

Programming Frameworks and System Software

DescriptionYGM is a general-purpose asynchronous distributed computing library for C++/MPI, designed to handle the irregular data access patterns and small messages of graph algorithms and data science applications. It uses data serialization to give an easily usable active message interface and message aggregation to maximize application throughput. Our design philosophy makes a tradeoff that increases network bandwidth utilization at the cost of added latency. We provide a suite of benchmarks showcasing YGM’s performance. Compared to similar distributed active message benchmark implementations that do not provide message buffering, we are able to achieve over 10x throughput on thousands of cores at a latency cost that can be as small as 2x or as large as 100x, depending on the machine being used. For applications that can be written to be latency-tolerant, this represents a significant potential performance improvement through using YGM.

Workshop

Emerging Technologies and HPC Education, Outreach, and Training

Education

State of the Practice

Workshop

Emissions and Energy Efficiency on Large-Scale High Performance Computing Facilities: ARCHER2 UK National Supercomputing Service Case Study

Energy Efficiency

Green Computing

Sustainability

DescriptionLarge supercomputing facilities are critical to research in many areas that impact on decisions such as how to address the current climate emergency, including renewable energy facility design and new battery technologies. However, these systems themselves are a source of large amounts of emissions due to the embodied emissions associated with their construction and decommissioning; and the power consumption associated with running the facility. Recently, we have been analyzing the impact of a UK national HPC facility (ARCHER2) in terms of energy and emissions. Based on this work, we have made changes to the operation of the service that give a saving of more than 20% in power draw of the computational resources, with all application benchmarks showing reduced power to solution. We describe our analysis and the changes made to the operation of the service to improve its energy efficiency, and thereby reduce its climate impacts.

Workshop

Empowering Scientific Discovery through Computing at the Advanced Photon Source

Large Scale Systems

Performance Measurement, Modeling, and Tools

Software Engineering

DescriptionThis paper explores the challenges and solutions for managing the vast amount of data generated by the Advanced Photon Source (APS), a synchrotron light source producing ultra-bright x-rays for diverse scientific domains. With 68 experimental beamlines, the APS serves a wide user base across academia, government, and industry. The ongoing upgrade of the APS storage ring and new instruments will amplify data generation and processing demands. This paper discusses the approach to address these demands through automated standardized workflows for faster scientific insights. The APS Data Management System coordinates data related tasks and interfaces with Globus. Through integration with the Argonne Leadership Computing Facility (ALCF), APS users can efficiently access HPC resources. Standardized workflows have led to reduced computational burdens on scientists and greater accessibility of HPC resources. We demonstrate how standardization and collaboration enable scientists to rapidly produce meaningful scientific results, establishing a streamlined path from collection to publication.

Workshop

Enabling Agile Analysis of I/O Performance Data with PyDarshan

Performance Measurement, Modeling, and Tools

Programming Frameworks and System Software

DescriptionModern scientific applications utilize numerous software and hardware layers to efficiently access data. This approach poses a challenge for I/O optimization because of the need to instrument and correlate information across those layers. The Darshan characterization tool seeks to address this challenge by providing efficient, transparent, and compact runtime instrumentation of many common I/O interfaces. It also includes command-line tools to generate actionable insights and summary reports. However, the extreme diversity of today's scientific applications means that not all applications are well served by one-size-fits-all analysis tools.

In this work, we present PyDarshan, a Python-based library that enables agile analysis of I/O performance data. PyDarshan caters to both novice and advanced users by offering ready-to-use HTML reports as well as a rich collection of APIs to facilitate custom analyses. We present the design of PyDarshan and demonstrate its effectiveness in four diverse real-world analysis use cases.

Workshop

Enabling Codesign in the Software Tools Ecosystem Project (STEP)

Codesign

Hardware Technologies

Large Scale Systems

Software Engineering

DescriptionSoftware tools are crucial for understanding and optimizing the performance and behavior of scientific applications. To do this job effectively, they must operate at the crossroads of applications, system software, facility operations, and platform technologies. These constraints present a daunting set of challenges, but also a rich opportunity for codesign. This talk will provide an overview of the nascent Software Tools Ecosystem Project (https://www.ascr-step.org) with a particular focus on how it plans to apply the principles of codesign to address critical cross-cutting challenges.

Workshop

Enabling Communication with FPGA-Based Network-Attached Accelerators for HPC Workloads

Architecture and Networks

DescriptionThe use of stand-alone, network-coupled FPGA accelerators is intended to significantly increase the energy efficiency of HPC applications and thus also of HPC data centers. A loose coupling between the nodes of the HPC data center and the FPGAs is established through the high-speed network of the data center. This allows greater flexibility in combining different nodes and accelerators. Both the resulting energy savings and the increased flexibility through the network connection, enable the economical use of FPGAs. This work presents a communication stack to integrate the so-called Network-attached Accelerator (NAA) into the HPC data center. A low-level RDMA API and a high-level Remote Procedure Call API is designed on top of the RoCEv2 communication stack. The experimental results over 100 Gbps RoCEv2 show that our design and implementation deliver performance close to the theoretical maximum.

Birds of a Feather

Enabling I/O and Computation Malleability in High-Performance Computing

Programming Frameworks and System Software

XO/EX

DescriptionTraditional interest in increasing the parallelism for individual jobs in HPC systems is conditioned by the diversity and dynamics of their resource demands at runtime. Malleability techniques can help to dynamically adapt resource usage to achieve maximum efficiency. Malleable HPC systems, however, present a series of fundamental research challenges in the fields of resource management, scheduling, malleability control, flexibilization of application structures, and data movement. All aforementioned issues will be discussed in the Birds of a Feather session, which aims at building a community of developers and users around the topic of malleability in high-performance computing, networking, and storage.

Workshop

Enabling In Situ Visualization of Large-Scale Cellular Simulations

Data Analysis, Visualization, and Storage

Large Scale Systems

Performance Measurement, Modeling, and Tools

DescriptionThe significance of studying cellular systems in silico is underscored by persistent innovation in computational models. These models can now capture hundreds of millions of cells to recapitulate physiological behavior. The growing scale of models, however, poses challenges not only for visualization and analysis of the data but also for the process of simulation maintenance. Without proper, flexible analysis routines in place, the deployment of these models continues to lag. This paper presents an approach to enable in situ visualization and analysis of large-scale fluid-structure-interaction models for real-time data interrogation and visualization on leadership class systems in preparation for increasing scale. We demonstrate the feasibility and explore the flexibility of this pipeline on a complex cell model with millions of components running on the Summit supercomputer.

Workshop

Enabling Large Dynamic Neural Network Training with Learning-Based Runtime Memory Management

Distributed Computing

Middleware and System Software

Runtime Systems

DescriptionDynamic neural network (DyNN) enables high computational efficiency and strong representation capability. However, training DyNN can face a memory capacity problem because of increasing model size or limited GPU memory capacity. Managing tensors to save GPU memory is challenging, because of the dynamic structure of DyNN. We introduce DyNN-Offload, a memory-management runtime system to train DyNN. DyNN-Offload uses a learned approach (using a neural network called the pilot model) to increase predictability of tensor accesses to facilitate memory management. The key of DyNN-Offload is to enable fast inference of the pilot model in order to reduce its performance overhead, while providing high inference (or prediction) accuracy. DyNN-Offload reduces input feature space and model complexity of the pilot model based on a new representation of DyNN. DyNN-Offload enables 8× larger DyNN training on a single GPU compared with using PyTorch alone (unprecedented with any existing solution). Evaluating with AlphaFold (a production-level, large-scale DyNN), we show that DyNN-Offload outperforms unified virtual memory (UVM) and dynamic tensor rematerialization (DTR), the most advanced solutions to save GPU memory for DyNN, by 3× and 2.1× respectively in terms of maximum batch size.

Workshop

Enabling Performance for NGC Containers on the Slingshot 11 Interconnect

DescriptionContainers based on NVIDIA GPU Cloud (NGC) images have become increasingly popular for deploying optimized software on NVIDIA GPUs, particularly in the context of ML/AI frameworks and models. However, it's important to note that the software stack within NGC images lacks the components necessary to interact with the HPE Slingshot 11 interconnect, which is a high-speed network utilized in some of the world's most powerful supercomputers. This limitation adds to the challenge of efficiently running containers for this noteworthy combination of systems and use cases.

This presentation aims to share insights into the process of enabling NGC-based containers to leverage Slingshot 11. The discussion will cover key elements for optimizing application performance, including the NCCL communication collectives, the libfabric communication framework, and GPUDirect RDMA. The presentation will also feature quantitative results from synthetic benchmarks that measure communication bandwidth and deep learning performance using the PyTorch framework.

Workshop

Enabling Performant Thermal Conductivity Modeling with DeePMD and LAMMPS on CPUs

Artificial Intelligence/Machine Learning

DescriptionThe ability to retain DFT-level accuracy and reduce the high computational costs has been made possible using Deep Potential models which allow accurate prediction of interatomic force and energy distributions, when trained on DFT data. DeePMD-kit is a Python/C++ package which implements such a model. In this paper, we extend DeePMD to accurately predict the thermal conductivity for crystalline Au and Ag systems of up to 2 million atoms. We demonstrate that both DeePMD training and DeePMD inference with LAMMPS can be run efficiently on CPU-based systems. On a single node of 4th generation Intel® Xeon® Scalable 9480 processors, we can train the model in less than 5 minutes. Using this trained model with LAMMPS on 128 dual-socket nodes with Intel® Xeon® Scalable 8480+ processors, we can accurately determine the thermal conductivity of crystalline Au and Ag systems, within 5% of experimental results, in under one hour.

Workshop

Enabling Quantum Computer Simulations on AMD GPUs: A HIP Backend for Google's qsim

Quantum Computing

Software Engineering

DescriptionQuantum computer simulators play an essential role in advancing the field of quantum computing, serving as indispensable tools for quantum computer verification, debugging, and quantum algorithm prototyping. Among these simulators, Google's state vector quantum simulator qsim has gained significant popularity for its high performance and support for AVX512 vector instructions, OpenMP, and NVIDIA GPUs. However, the lack of support for AMD GPUs presents a limitation in qsim's widespread applicability to current large-scale supercomputers with AMD GPUs, including the current fastest supercomputer in the world, Frontier. This work addresses this gap by developing a qsim software backend that leverages the AMD HIP (Heterogeneous-Compute Interface for Portability) programming interface and tools. We discuss the efficiency and effectiveness of the newly introduced support for AMD GPUs through performance evaluations and comparisons with the existing Nvidia GPU backend.

Paper

Enabling Real World Scale Structural Superlubricity All-Atom Simulation on the Next-Generation Sunway Supercomputer

Accelerators

Applications

Architecture and Networks

Modeling and Simulation

DescriptionMolecular dynamics (MD) simulation can provide an affordable way for inspecting microscopic phenomena, which is a powerful complement to real-world experiments. But the spatial scale of MD simulations is usually magnitudes smaller than experiment systems. In this paper, we present our work, redesigning the widely used inter-layer potential in structural superlubricity. By carrying out a specialized neighbor list for inter-layer potential computation, the total memory access amount is reduced significantly. Besides, a simple but efficient vectorization strategy is implemented based on the new neighbor list. In the extreme case, our work can scale to 38 million cores to achieve a sustainable performance of 61 PFLOPS, enabling a simulation of a superlubricity system of 32 um^2 with 7.2 billion atoms at 4.75 ns/day, which is 11,834 times of reported largest scale simulation in superlubricity systems in contact area and almost ten times faster in time-to-solution.

Doctoral Showcase

Posters

Enabling Reproducibility and Scalability of Scientific Workflows in HPC and Cloud

Reproducibility

DescriptionScientific communities across fields like earth science, biology, and materials science increasingly run complex workflows for their scientific discovery. We work closely with these communities to leverage high-performance computing (HPC), big data analytics, and artificial intelligence/machine learning (AI/ML) to increase and accelerate their workflows’ productivity. Our work addresses the new challenges brought about by this optimization process.

We identify three main challenges in these workflows: i) they integrate AI/ML methods with limited transparency and include many interoperable components (data and applications) that are hard to trace and reuse to reproduce results; ii) they hide the complexity of large intermediate data and their overall execution can be affected by the I/O bandwidth of the underlying infrastructure; and iii) they run on heterogeneous and distributed infrastructure with data and application dependencies that require efficient data management and resource allocation.

To address these challenges, we provide solutions that leverage the convergence between high-performance and cloud computing. First, we design and develop fine-grained containerized environments that enable data traceability and results explainability by automatically annotating and seamlessly attaching provenance information. Second, since the workflows are already containerized, we integrate them in HPC and native-cloud infrastructure and tune the storage technology to enable better I/O and data scalability. Finally, we orchestrate the end-to-end execution of workflows, ensuring efficient allocation of infrastructure resources and intermediate data management, and supporting reproducibility and reusability of workflows’ executions.

Workshop

Enabling Scalable VQE Simulation on Leading HPC Systems

Quantum Computing

Software Engineering

DescriptionLarge-scale simulations of quantum circuits pose significant challenges, especially in the context of quantum chemistry, due to the number of qubits, circuit depth, and the number of circuits needed per problem. High-performance computing (HPC) systems offer massive computational capabilities that could help overcome these obstacles. We developed a high-performance quantum circuit simulator, called NWQ-Sim, and demonstrate its capability to simulate large quantum chemistry problems on NERSC's Perlmutter supercomputer. Integrating NWQ-Sim with XACC, we have executed QPE and VQE algorithms for downfolded quantum chemistry systems at unprecedented scales. Our work demonstrates the potential of leveraging HPC resources to advance quantum chemistry and other applications of near-term quantum devices.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Enabling Transparent, High-Throughput Data Movement for Scientific Workflows on HPC Systems

XO/EX

DescriptionThis poster presents the DYnamic and Asynchronous Data Streamliner (DYAD) middleware that provides an efficient and transparent method for data movement in scientific workflows based on the producer-consumer paradigm. We develop DYAD on top of Flux, a fully hierarchical HPC workload manager, and Unified Communication X (UCX), a unified framework for networking on HPC systems. We measure DYAD's performance with a suite of mini-apps and show how it outperforms traditional methods for data transfer while providing a higher level of transparency.

Workshop

End-to-End Workflows for Climate Science: Integrating HPC Simulations, Big Data Processing, and Machine Learning

Applications

Cloud Computing

Distributed Computing

Edge Computing

Large Scale Systems

DescriptionCurrent scientific workflow systems do not typically integrate simulation-centric and data-centric aspects due to their very different software/infrastructure requirements. A transparent integration of such components into a single end-to-end workflow would lead to a more efficient and automated way for generating insights from large simulation data. This work presents a complex case study related to extreme events analysis of future climate data that integrates in the same workflow numerical simulations, Big Data analytics and Machine Learning models. The case study is being implemented in the context of the eFlows4HPC project using the project's software stack for deployment and orchestration of the workflow. The solution implemented in the project has shown to simplify the development and execution of end-to-end climate workflows with heterogeneous software requirements. Moreover, such an approach can, in the long term, increase the reuse of workflows by scientists and their portability over different HPC infrastructures.

Workshop

Energy Consumption Comparison of Parallel Linear Systems Solver Algorithms on HPC Infrastructure

Artificial Intelligence/Machine Learning

Energy Efficiency

Green Computing

Performance Measurement, Modeling, and Tools

Sustainability

DescriptionHigh-Performance Computing (HPC) systems today are gradually increasing in size and complexity due to the correspondent demand for ever-increasing computing needs, requiring more complicated tasks and higher accuracy. The growing energy demands of HPC systems necessitate the urgent adoption of green HPC approaches to mitigate environmental impact and promote energy-efficient computing.

This paper explores a monitoring solution for the energy values detected during the execution of two parallel algorithms for the solution of linear systems: the Inhibition Method and Gaussian Elimination from ScaLAPACK library. The main goal is to profile their execution from the energy consumption perspective. Moreover, it also collates the energy and power values for different ranks, nodes, and sockets configurations. The monitoring tools employed to track the energy consumption of these algorithms are PAPI and RAPL, which will be integrated with the parallel execution of the algorithms managed with the Message Passing Interface MPI.

Workshop

Energy Efficiency of Quantum Statevector Simulation at Scale

Energy Efficiency

Green Computing

Sustainability

DescriptionClassical simulations are essential for the development of quantum computing, and their exponential scaling can easily fill any modern supercomputer. In this paper we consider the performance and energy consumption of large Quantum Fourier Transform (QFT) simulations run on ARCHER2, the UK's National Supercomputing Service, with QuEST toolkit. We take into account CPU clock frequency and node memory size, and use cache-blocking to rearrange the circuit, which minimizes communications. We find that using 2.00GHz instead of 2.25GHz can save as much as 25% of energy at 5% increase in runtime. Higher node memory also has the potential to be more efficient, and cost the user fewer CUs, but at higher runtime penalty. Finally, we present a cache-blocking QFT circuit, which halves the required communication. All our optimizations combined result in 40% faster simulations and 35% energy savings in 44 qubit simulations on 4,096 ARCHER2 nodes.

Tutorial

Energy-Efficient GPU Computing

Accelerators

Energy Efficiency

TUT

DescriptionEnergy efficiency has become a critical concern in High Performance Computing (HPC) and supercomputing, especially with the advent of exascale systems. The increasing demand for computational power and the associated energy consumption have led to a growing need for optimization techniques to reduce power consumption. GPUs, now the primary source of compute power in exascale supercomputers, contribute significantly to the overall energy expenditure of these systems. Consequently, the development and implementation of energy-efficient strategies for GPU applications are essential to reduce the environmental impact and operational costs of HPC facilities.

This tutorial offers a comprehensive introduction to energy-efficient computing in the context of HPC, focusing on GPU applications. As a participant, you will gain insight into code optimization techniques that improve energy efficiency, automatically explore performance-energy trade-offs using Kernel Tuner, dive into mixed-precision techniques, and learn how to write clean code for reduced-precision arithmetic on GPUs.

Finally, the tutorial addresses GPU clock frequency optimization as a means to improve energy efficiency, including how to find the optimal core clock frequency range. The hands-on approach of this tutorial enables participants to acquire valuable knowledge and practical experience in energy-efficient computing, essential for advancing environmentally sustainable and cost-effective HPC and supercomputing solutions.

Birds of a Feather

Engaging with HPC in the DoD

State of the Practice

XO/EX

DescriptionThe DoD has invested significant time and funding to acquire HPC systems, software, and networking to support a large base of DoD and industry users on HPC-backed projects. This BoF will use interactive lightning talks about current and future research, technology acquisition plans, and software development needs that align with DoD goals. These interactive talks are intended to help external organizations and researchers connect with DoD HPC leadership, encourage partnerships, strengthen diversity, and collaborate with problem solving. External engagement will help DoD users and HPC sites enhance expertise and connect to the larger HPC community.

Paper

Enhance the Strong Scaling of LAMMPS on Fugaku

Accelerators

Applications

Architecture and Networks

Modeling and Simulation

DescriptionPhysical phenomenon such as protein folding requires simulation up to microseconds of physical time, which directly corresponds to the strong scaling of molecular dynamics(MD) on modern supercomputers. In this paper, we present a highly scalable implementation of the state-of-the-art MD code LAMMPS on Fugaku by exploiting the 6D mesh/torus topology of the TofuD network. Based on our detailed analysis of the MD communication pattern, we first adapt coarse-grained peer-to-peer ghost-region communication with uTofu interface, then further improve the scalability via fine-grained thread pool. Finally, Remote direct memory access (RDMA) primitives are utilized to avoid buffer overhead. Numerical results show that our optimized code can reduce 77% of the communication time, improving the performance of baseline LAMMPS by a factor of 2.9x and 2.2x for Lennard-Jones and embedded-atom method potentials when scaling to 36, 846 computing nodes. Our optimization techniques can also benefit other applications with stencil or domain decomposition methods.

Paper

Enhancing Adaptive Physics Refinement Simulations through the Addition of Realistic Red Blood Cell Counts

Applications

Modeling and Simulation

DescriptionSimulations of cancer cell transport require accurately modeling mm-scale and longer trajectories through a circulatory system containing trillions of deformable red blood cells, whose intercellular interactions require submicron fidelity. Using a hybrid CPU-GPU approach, we extend the advanced physics refinement (APR) method to couple a finely-resolved region of explicitly-modeled red blood cells to a coarsely-resolved bulk fluid domain. We further develop algorithms that: capture the dynamics at the interface of differing viscosities, maintain hematocrit within the cell-filled volume, and move the finely-resolved region and encapsulated cells while tracking an individual cancer cell. Comparison to a fully-resolved fluid-structure interaction model is presented for validation. Finally, we use the advanced APR method to simulate cancer cell transport over a mm-scale distance while maintaining a local region of RBCs, using a fraction of the computational power required to run a fully-resolved model.

Workshop

Enhancing Heterogeneous Federated Learning with Knowledge Extraction and Multi-Model Fusion

Artificial Intelligence/Machine Learning

DescriptionConcerned with user data privacy, this paper presents a new federated learning (FL) method that trains machine learning models on edge devices without accessing sensitive data. Traditional FL methods, although privacy-protective, fail to manage model heterogeneity and incur high communication costs due to their reliance on aggregation methods. To address this limitation, we propose a resource-aware FL method that aggregates local knowledge from edge models and distills it into robust global knowledge through knowledge distillation. This method allows efficient multi-model knowledge fusion and the deployment of resource-aware models while preserving model heterogeneity. Our method improves communication cost and performance in heterogeneous data and models compared to existing FL algorithms. Notably, it reduces the communication cost of ResNet-32 by up to 50% and VGG-11 by up to 10x while delivering superior performance.

Workshop

Enhancing Metadata Transfer Efficiency: Unlocking the Potential of DAOS in the ADIOS Context

Data Analysis, Visualization, and Storage

Data Movement and Memory

DescriptionIn HPC, data movement between applications is typically facilitated by I/O middleware, such as the Adaptable I/O System (ADIOS). This middleware leverages the capabilities of the underlying storage services, to facilitate data movement and distribution. A recent storage system, Intel DAOS, promises to deliver new capabilities for achieving performance and scalability on emerging memory/storage systems. DAOS has already deployed in University of Cambridge, TACC, and the upcoming Aurora supercomputer. This paper investigates the performance tradeoffs associated with mapping ADIOS over one of the many different DAOS interfaces and data models, and makes recommendations for their efficient use.

Workshop

Entropy-Based Regularization on Deep Learning Models for Anti-Cancer Drug Response Prediction

Applications

State of the Practice

DescriptionThis work studies a particular setting for regression problems – tasks with complex combinatorial data space where samples can be divided into distinct groups. Anti-cancer drug response prediction is a perfect example of this setting, in which each sample includes cancer biological features and drug chemical information. Many existing works of pan-drug and pan-cancer response modeling treat different combinations of drugs and cancers as individual samples. A potential problem in these works is that a model may be heavily influenced and biased toward overrepresented drugs and cancers. Our work develops a method to solve this issue by adjusting the model training process in a deep learning framework.

In the drug response prediction field, the performance of pan-cancer pan-drug models is commonly evaluated on a holdout test set through cross-validation (CV) using performance metrics like the coefficient of determination (R2) and the mean squared error (MSE). However, drug response prediction can be viewed as a multi-objective optimization task, attempting to maximize the prediction performance over different drugs and cancers. We consider the performance for each drug as a separate objective and attempt to find a model on the corresponding Pareto front that provides balanced performances for all compounds. We propose adding an entropy-based regularization to the loss function for model training to reach this balanced state. The intuition behind it is straightforward – we minimize the MSE on all data points while encouraging the drug-specific model fitting error variability to stay as low as possible. We achieve this by grouping samples by their drug identity and computing the MSE for each group in the training batch. Then we calculate entropy over normalized group-specific losses. This value is plugged into the regularization term that incentivizes the loss function to maximize it. The maximum entropy can be achieved when the MSEs across all drugs take the same value.

We investigate the regularization effect on response modeling using a drug screening dataset of Cancer Cell Line Encyclopedia (CCLE) and a state-of-the-art drug response prediction model DeepTTA [1]. We consider two CV strategies for model evaluation – random split and drug-blind split. In a random split, the testing set can share both cell lines and drugs with the training set, while in a drug-blind split, a drug can not appear simultaneously in both the training and testing sets. We perform 10-fold CV analyses and evaluate the model performance using R2. For the random split, we see that adding the entropy-based regularization leads to a marginal improvement in prediction performance, which is 0.736 without regularization versus 0.746 with regularization and a p-value of 0.130 from the pairwise t-test. Importantly, we observe a substantial improvement in the more challenging setting of drug-blind split, where the prediction performance increases from -0.128 (without regularization) to 0.168 (with regularization) with a statistically significant p-value of 0.005 from pair-wise t-test.

1 Jiang, L., Jiang, C., Yu, X., Fu, R., Jin, S., and Liu, X.: ‘DeepTTA: a transformer-based model for predicting cancer drug response’, Briefings in Bioinformatics, 2022, 23, (3), pp. bbac100

Workshop

Entropy-Driven Optimal Sub-Sampling of Fluid Dynamics for Developing Machine-Learned Surrogates

Artificial Intelligence/Machine Learning

DescriptionOptimal sub-sampling of large datasets from fluid dynamics simulations is essential for training reduced-order machine learned models. A method using Shannon entropy was developed to weight flow features according to their level of information content, such that the most informative features can be extracted and used for training a surrogate model. The method is demonstrated in the canonical flow over a cylinder problem simulated with OpenFOAM. Both time-independent predictions and temporal forecasting were investigated as well as two types of prediction targets: local per-grid-point predictions and global per-time-step predictions. When tested on training a surrogate model, results indicate that our entropy-based sampling method typically outperforms random sampling and yields more reproducible results in fewer iterations. Finally, the method was used to train a surrogate model for modeling turbulence in magnetohydrodynamic flows, which revealed various challenges and opportunities for future research.

Workshop

Environmental Factors and Lung Cancer: A Predictive Spatial Approach

Applications

State of the Practice

DescriptionLung cancer has witnessed a substantial increase in prevalence over the past few decades. While studies have established that the environment is the primary cause of most lung cancer cases, the development of lung cancer may be the result of the combined impact of multiple environmental factors.

Our objective is to investigate the relationship between lung cancer incidence and various physical ambient factors, including climatology, air quality, meteorological conditions, and soil vegetation. To address the issue of missing data on lung cancer cases at the county level in 2020, we use a Bayesian spatial and temporal modeling approach to mapping geographic variation in lung cancer mortality rates for subnational areas with R-INLA.

Our predictive model is constructed using multiple independent variables obtained from various satellite sources, covering the period from 1960 to 2016. Climate data such as heatwaves, extreme temperatures are from National Oceanic and Atmospheric Administration (NOAA) and PRISM climate group (PRISM). Air quality indicators such as PM2.5, NO2, and ozone levels are sourced from NASA's Earth Data. Observational meteorological data, encompassing temperature, dew point, wind direction, wind speed, cloud cover, cloud layers, ceiling height, visibility, current weather, and precipitation amount, are obtained from the EPA's high-resolution gridded dataset. Soil vegetation and cropland data are acquired from the United States Department of Agriculture (USDA) using satellite imagery. Furthermore, we explore additional geophysical data available through the Google Earth Engine platform.

Our predictive model reveals an increasing positive association between multiple environmental factors and lung cancer incidence over the years. We apply a linear model with group fixed effects to 2012-2016 data, assessing lung cancer's relative risk and generating a 2017-2021 environmental vulnerability map. This work highlights AI and integrated data analysis' potential in interpreting and predicting complex health phenomena like lung cancer.

Workshop

EPSOUQ-HPC – Morning Break

Performance Optimization

Workshop

ESPM2 – Afternoon Break

Large Scale Systems

Middleware and System Software

Programming Frameworks and System Software

Workshop

ESPM2 2023 – Lunch Break

Large Scale Systems

Middleware and System Software

Programming Frameworks and System Software

Workshop

ESPM2 2023 – Morning Break

Large Scale Systems

Middleware and System Software

Programming Frameworks and System Software

ACM Gordon Bell Finalist

Awards

Establishing a Modeling System in 3-km Horizontal Resolution for Global Atmospheric Circulation Triggered by Submarine Volcanic Eruptions with 200 Billion Smoothed Particles Hydrodynamics

DescriptionPeople are increasingly concerned about how tectonic processes affect climate and vice versa. We establish a cross-sphere modeling system for volcanic eruptions and atmosphere circulation on a new Sunway supercomputer with a spatial resolution from 10m locally to 3km globally, using an improved multi-medium and multiphase smoothed particle hydrodynamics (SPH) combined with a fully coupled meteorology-chemistry global atmospheric modeling scheme. We achieve 400 billion particles and 80% parallel efficiency using 39,000,000 processor cores. The simulation captures the whole dynamic process of the Tonga eruption from shock waves, earthquakes, tsunamis, mushroom clouds to the following 6-7 days of transport and diffusion of ash and water vapor, and preliminarily obtains the influence effect of full coupling of volcano, earthquake, ocean and atmosphere. This work is of great significance for deeply understanding the interaction between tectonic processes and climate change, and establishing an early warning simulation system for similar global hazard events.

Exhibitor Forum

Ethernet-Based Interconnect, the Critical Crossroads for HPC and AI Networking at Scale

Accelerators

Artificial Intelligence/Machine Learning

Architecture and Networks

Hardware Technologies

XO/EX

DescriptionThe convergence of HPC and AI entails an explosion of performance of the number of nodes/cores, data volume, and data movement. In the coming years, the deployment of AI networks, ranging from rack-scale to datacenter-scale, is set to accelerate, necessitating the evolution of networking technology to effectively support and accommodate these needs.

Eviden Vision and solution:
The existing HPC & AI networking choices have their virtues to continuously support the simulation evolution. However, depending on the specific requirements, e.g. workload characteristics, performance considerations, and budget constraints, customers should have the flexibility and choices to an open, interoperable, high performance, full-communications stack architecture to meet the growing network demands at scale, whether on-premises or in the cloud.

As one of the founding members of the UEC (Ultra Ethernet Consortium), Eviden supports an Ethernet-based, multivendor interoperable, scalable, and cost-effective high-performance networking for HPC and AI workload, focusing on:

• Performance and reliability: extremely low latency for efficient communication between nodes for both HPC and AI workloads, supporting massive data transfer or high-throughput inter-node communication. Effectively manage packet processing, network congestion, and message handling protocols, enabling low-latency and substantial data transfers with RDMA over Ethernet, ensuring data integrity.

• Interoperability and Simplicity: Leveraging the ubiquity, mature ecosystem, vendor support, and interoperability with various software and toolsets, Ethernet-based next-gen networking fabric offers seamless integration with popular operating systems, cluster management software, storage solutions, and distributed file systems commonly used in data centers.

• Scalability and cost-effectiveness: the affordability of Ethernet hardware, software and the widespread adoption in commercial networks makes it a cost-effective and easy-to-scale networking choices.

With the new workload scale and complexity, optimizing the interconnect networking for HPC and AI systems become a major contributor to global performance. Ethernet-based high-performance networking technology provides a new avenue to tackle technical and economic challenges.

Birds of a Feather

European HPC Ecosystem – Updates and Gap Analysis

HPC in Society

XO/EX

DescriptionIn recent years, the European HPC ecosystem has undergone profound changes. EuroHPC JU a joint initiative between the EU, European countries, and private partners to develop a World Class Supercomputing Ecosystem in Europe was created. PRACE is in the process of transforming itself into a European HPC User and Centre Association.

The objective of this BoF is to give an overview of the current state of European HPC activities. We will present and discuss with the different European HPC stakeholders the current state of play, future plans, challenges and analyze critically the European HPC offers and services.

Birds of a Feather

European RISC-V HPC and AI Pre-Exascale Accelerators

Architecture and Networks

XO/EX

DescriptionThis BoF aims to foster discussion on RISC-V accelerators led by efforts on European accelerators for HPC and foster community interest in these projects. There are several accelerator efforts around the HPC community in Europe, many of them leveraging and fostering the RISC-V ecosystem. We will start with a short presentation (15 minutes) on a brief overview of current efforts and a quick insight into EUPILOT (part of the European Processor Initiative - EPI effort) to start the conversation. A Q&A session and open discussion with audience members will follow the introduction.

Workshop

Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger

Architecture and Networks

Hardware Technologies

DescriptionIn recent years, computers based on the RISC-V architecture have raised broad interest in the high-performance computing (HPC) community. As the RISC-V community develops the core instruction set architecture (ISA) along with ISA extensions, the HPC community has been actively ensuring HPC applications and environments are supported. In this context, assessing the performance of asynchronous many-task runtime systems (AMT) is important. We describe our experience with porting of a full 3D adaptive mesh-refinement, multi-scale, multi-model, and multi-physics application, Octo-Tiger, that is based on the HPX AMT, and we explore its performance characteristics on different RISC-V systems. The demonstrated results confirm that Octo-Tiger shows good scaling behavior on all tested systems. We, however, expect that exceptional hardware support based on dedicated ISA extensions (such as single-cycle context switches, extended atomic operations, and direct support for HPX's global address space) would allow for even better performance results.

Posters

Research Posters

Evaluating Performance Portability of GPU Programming Models

Heterogeneous Computing

Performance Measurement, Modeling, and Tools

DescriptionMaintaining a single codebase that can achieve good performance on a range of accelerator-based supercomputing platforms is of extremely high value for productive scientific application development. However, the large quantity of programming models available which claim to provide performance portability leaves developers with a complex choice when picking a model to use, potentially requiring an intensive effort to test each available model with kernels from their app. In order to better understand the current state of performance portable programming models, this project evaluates seven of the most popular programming models using two memory-bound mini-applications on two leadership-class supercomputers, Summit and Perlmutter. These results provide a useful evaluation of how well each programming model provides true performance portability in real-world usage for memory-bound applications.

Workshop

Evaluating Primitives in Deep Neural Network Libraries: A Case Study with the Softmax Functions

Accelerators

Edge Computing

Heterogeneous Computing

DescriptionA deep neural network library (DNNL) is an optimized library of low-level computational primitives for deep neural networks. In this study, we choose the softmax function, a primitive commonly used in new computing models for DNNs, as a case study on evaluating the unique programming models adopted by the vendors’ DNNLs (cuDNN, MIOpen, and oneDNN) and the performance and portability of DNNLs on NVIDIA and AMD GPUs. We find that cuDNN selects different compute kernels to execute based on the problem size for the primitive, which may have a significant performance impact. oneDNN successfully enables functional portability of the primitive across vendors’ platforms, but performance portability will need to be improved. In addition, the performance of a primitive in the DNNLs may be suboptimal compared to a custom implementation.

Workshop

Evaluating the Latest Optane Memory: A Glorious Swansong?

Data Movement and Memory

Hardware Technologies

Heterogeneous Computing

Performance Measurement, Modeling, and Tools

DescriptionWe evaluate the 3rd generation of Intel's Optane non-volatile memory technology, assess the performance it can provide, and investigate the modes of use that can be beneficial in high performance computing, both for application performance and system architecture. We demonstrate sustained performance and functionality improvements from the latest hardware, along with I/O and memory performance and functionality that is not available from other memory or storage hardware. We show that leveraging Optane can provide significant reductions in the required volatile memory for applications, with minimal performance impacts, with appropriate memory hierarchy designs and considerations.

Workshop

Evaluating the Performance of One-Sided Communication on CPUs and GPUs

Performance Measurement, Modeling, and Tools

Performance Optimization

DescriptionAs high-performance GPU computing becomes the trend, GPU-initiated one-sided communication becomes a viable solution for multi-GPU scaling. It also raises attention to the use of one-sided communication on CPUs. However, the lack of deep understanding of one-sided communication performance and its impact on an application's performance becomes a hurdle. In this paper, we overcome this hurdle by proposing a Message Roofline model, which characterizes an application’s sustained messaging performance (GB/s) as a function of its message size, number of messages per synchronization, peak network bandwidth, and network latency. We use three benchmarks to demonstrate the potentials of one-sided communication on CPUs and GPUs. These benchmarks include Stencils, Sparse Triangular Solve and Distributed HashTable. Our evaluation provides insights into practically understanding the two-sided and one-sided communications in MPI applications, and can also guide hardware vendors with design principles lest the potential performance of one-sided communications being under-utilized.

Workshop

Evaluating the Performance Portability of SYCL across CPUs and GPUs on Bandwidth-Bound Applications

Performance Measurement, Modeling, and Tools

Performance Optimization

DescriptionWe evaluate the portability of the SYCL programming model on some of the latest CPUs and GPUs from a wide range of vendors, utilizing the two main compilers: DPC++ and hipSYCL/OpenSYCL. Both compilers currently support GPUs from all three major vendors; we evaluate performance on the Intel(R) Data Center GPU Max 1100, the NVIDIA A100 GPU, and the AMD MI250X GPU. Support on CPUs currently is less established, with DPC++ only supporting Intel CPUs through OpenCL, however, OpenSYCL does have an OpenMP backend capable of targeting all modern CPUs; we benchmark the Intel Xeon Platinum 8360Y Processor (Ice Lake), the AMD EPYC 7V73X (Milan-X), and the Ampere Altra platforms. We study a range of primarily bandwidth-bound applications implemented using the OPS and OP2 DSLs, evaluate different formulations in SYCL, and contrast their performance to “native” programming approaches where available (CUDA/HIP/OpenMP).

Workshop

Evaluating the Potential of Elastic Jobs in HPC Systems

Modeling and Simulation

Performance Measurement, Modeling, and Tools

DescriptionIt is generally assumed that elastic parallel applications, with the ability to dynamically resize their process count, would provide numerous benefits to High-Performance Computing (HPC) systems and applications. Supporting this capability, however, requires significant effort at several layers of the HPC software stack. At a minimum, the resource management system, the distributed communication libraries, and the distributed applications themselves would have to explicitly support elasticity. With this level of widespread support required, there must be significant motivation for developers to commit to adding this capability. We aim to determine whether there are practical benefits to supporting elasticity by simulating HPC systems with support for elastic jobs using real-world job data. Our simulations show significant improvements of adding elastic jobs with up to 35.34% higher system utilization, 75.3% lower runtime, 99.76% lower wait time, and 75.22% lower total turnaround time.

Workshop

Evaluating the Resiliency of Posits for Scientific Computing

Fault Handling and Tolerance

Large Scale Systems

DescriptionIEEE-754 is the de-facto standard for the implementation of floating point number systems in hardware, although recently, posits have been proposed as a drop-in replacement. Recent work has suggested that posits can offer greater numerical accuracy and reproducibility than IEEE-754-compliant floating point numbers at a comparable architectural cost. There have been several studies that consider the use of posits and other floating-point implementations in hardware and software, but there is limited work examining this new number system from a reliability perspective. In this paper, we evaluate the resiliency of posits to inform hardware design for fault-tolerant systems. Our analysis breaks down the impact of bit flips on the various fields within both floating-point standards. After examining the patterns and quirks regarding bit flip error in posits, we conclude that posits offer superior resiliency than the IEEE-754 standard in the majority of cases.

Workshop

Evaluating Total Environmental Impact for a Computing Infrastructure

Energy Efficiency

Green Computing

Sustainability

DescriptionIn this presentation, we outline the results of a project to evaluate the total climate/carbon impact of a digital research infrastructure for a defined snapshot period. We outline the carbon model used to calculate the impact and the data collected to quantify that impact for a defined set of resources. We discuss the variation in potential impact across both the active and embodied carbon for computing hardware and produce a range of estimates on the amount of carbon equivalent climate impact for the snapshot period.

Awards

Everywhere and Nowhere: Envisioning a Computing Continuum for Science

DescriptionEmerging data-driven scientific workflows are seeking to leverage distributed data sources to understand end-to-end phenomenon, drive experimentation, and facilitate important decision making. Despite the exponential growth of available digital data sources at the edge, and the ubiquity of non-trivial computational power for processing this data, realizing such science workflows remains challenging. In this talk, I will explore a computing continuum that is everywhere and nowhere -- one spanning resources at the edges, in the core and in-between, and providing abstractions that can be harnessed to support science. I will also introduce recent research in programming abstractions that can express what data should be processed and when and where it should be processed, and autonomic middleware services that automate the discovery of resources and the orchestration of computations across these resources.

Workshop

ExaMPI – Afternoon Break

Exascale

Message Passing

Programming Frameworks and System Software

Workshop

ExaMPI: Workshop on Exascale MPI

Exascale

Message Passing

Programming Frameworks and System Software

DescriptionThe aim of this workshop is to bring together researchers and developers to present and discuss innovative algorithms and concepts in the message passing programming model and to create a forum for open and potentially controversial discussions on the future of MPI in the exascale era and beyond.

Birds of a Feather

Example Projects of HPC Data Center Heat Reuse

Energy Efficiency

State of the Practice

Sustainability

XO/EX

DescriptionEfficient energy usage of data centers has attention locally, nationally and globally. Many data centers are increasingly interested in utilizing waste heat reuse. Two organizations; CSC and NREL will provide an overview of the cooling and heat reuse processes with lessons learned from design, construction and operations.

The session will outline the metrics (ERF, ERE, CoP etc.) used and foster discussion of standards, gaps and the different approaches. Both sites will highlight metrics, methodologies and how differences affect the calculations.

Audience discussion and Q&A is aimed at engaging the community to understand potentiality for new waste heat reuse projects.

Workshop

Exascale and Beyond – Required Competences for the Computational Scientists

Education

State of the Practice

DescriptionTo thrive in the context of high-end computing, data centric cognitive computing and simulation, computational scientists require a diverse set of competences, encompassing technical, domain knowledge, and soft skills. As a computational applied mathematician in the high-performance computing community, the speaker has long-term experience in the efficient use of HPC for simulation and modeling, especially in the area of systems biology and physical phenomena. He has worked on performance analysis of advanced computer architectures and investigated methods that exploit these architectures in computational science research. He will discuss what are the urgent pressures for educating and upskilling computational scientists in the fast changing environment of exascale and beyond, quantum computing and generative AI.

ACM Gordon Bell Finalist

Awards

Exascale Multiphysics Nuclear Reactor Simulations for Advanced Designs

DescriptionENRICO is a coupled application developed under the US Department of Energy’s Exascale Computing Project (ECP) targeting the modeling of advanced nuclear reactors. It couples radiation transport with heat and fluid simulation, including the high-fidelity, high-resolution Monte-Carlo code Shift and the Computational fluid dynamics code NekRS. NekRS is based on rapidly convergent high-order spectral element discretizations that feature minimal numerical dissipation and dispersion.

On Frontier, NekRS has recently achieved an unprecedented milestone in breaching over 1 billion spectral elements and 350 billion degrees of freedom. Shift has demonstrated the capability to transport upwards of 1 billion particles per second in full core nuclear reactor simulations featuring complete temperature-dependent, continuous-energy physics on Frontier. Shift achieved a weak-scaling efficiency of 97.8% on 8192 nodes of Frontier and calculated 6 reactions in 214,896 fuel pin regions below 1% statistical error yielding first-of-a-kind resolution for a Monte Carlo transport application.

Panel

Exascale Software Ecosystems States of the Unions and SWOT Analysis

Exascale

Heterogeneous Computing

Software Engineering

TUT

XO/EX

DescriptionThis panel brings together experts and leads from national exascale initiatives around the globe focusing on stacks encompassing algorithms to system level software to share their insights and experiences, and to identify synergies for collaboration. Exascale systems being deployed and on the horizon feature diversity and heterogeneity of not only hardware but software ecosystems. On one hand, the variety of accelerator technologies, alongside processor, memory, networking and storage configurations, pose challenges for algorithm developers, domain specific language and library architects, and performance engineers. On the other hand, there are expectations for supporting modern software development and delivery tools for reproducibility, portability, efficiency and security to fulfill the edge to cloud to supercomputing continuum requirements for workflows. Against these backdrops, national initiatives are prioritizing and funding a diverse portfolio of initiatives to address the programmatic needs, which the panel will reflect on as a SWOT (strengths, weaknesses, opportunities and threats) analysis.

Posters

Scientific Visualization & Data Analytics Showcase

ExaWind at NREL: Upping the Ante

Data Analysis, Visualization, and Storage

Exascale

HPC in Society

Modeling and Simulation

Visualization

XO/EX

DescriptionThe objective of the ExaWind component of the Exascale Computing Project is to deliver many-turbine blade-resolved simulations in complex terrain. These simulations bring new challenges to both compute and analysis of the resulting data. In this paper/video, we visually explore the impact of ExaWind on wind simulations through two studies of a small wind farm under two atmospheric conditions. We then turn to analysis and review tools that visualization researchers at NREL use to answer the challenges that ExaWind brings.

Workshop

Expanding Horizons: Advancing HPC Education in Colombia through CyberColombia's Summer Schools

Education

State of the Practice

DescriptionHigh-performance computing (HPC) is an important tool for research, development, and the industry. With the recent expansion of machine learning applications, the need for HPC is increasing even further. However, in developing countries with limited access to the HPC ecosystem, the lack of infrastructure, expertise, and access to knowledge represents a major obstacle to the expansion of HPC. The adoption of HPC by communities presents several challenges. The HPC Summer Schools are an initiative of CyberColombia over the past 5 years, which aims to develop the critical skills, strategic planning, and networking required to disseminate and maintain the knowledge of high-performance computing and its applications in Colombia. Here we report the results of this series of Summer Schools. The events have proven to be successful, with over 200 participants from more than 20 institutions. Participants span different levels of expertise, including undergraduate and graduate students as well as professionals.

Workshop

Experiences Detecting Defective Hardware in Exascale Supercomputers

Applications

Exascale

Heterogeneous Computing

Programming Frameworks and System Software

State of the Practice

DescriptionIn May 2022, the newest supercomputer to top the TOP 500 list was Frontier at Oak Ridge National Laboratory, capable of computing more than 1.1 quintillion (10^18) floating-point calculations every second. Driving this ground-breaking rate of computing is Frontier's more than 37,000 graphics processing units (GPUs) and 9,408 central processing units (CPUs). At this scale, the smallest margin of error may generate hundreds of errors across the system. In this work, we describe and evaluate two strategies for finding hardware-level faults in Frontier's 9,408 compute nodes. There are two strategies developed: the first uses the Slurm scheduler to scavenge available compute time to run the node screen, the second enforces a weekly screen of each node. Using June 2023 as a case study, we find that the first scheduling strategy consumed ten times the resources as the second scheduling strategy, but successfully detected five hardware defects in Frontier.

Paper

Experiences Readying Applications for Exascale

Exascale

Large Scale Systems

State of the Practice

DescriptionThe advent of exascale computing invites an assessment of existing best practices for developing application readiness on the world’s largest supercomputers. This work details observations from the last four years in preparing scientific applications to run on the Oak Ridge Leadership Computing Facility's (OLCF) Frontier system. This paper addresses a range of topics in software including programmability, tuning, and portability considerations that are key to moving applications from existing systems to future installations. A set of representative workloads provides case studies for general system and software testing. We evaluate the use of early access systems for development across several generations of hardware. Finally, we discuss how best practices were identified and disseminated to the community through a wide range of activities including user-guides and trainings. We conclude with recommendations for ensuring application readiness on future leadership computing systems.

Paper

Experimental Evaluation of Xanadu X8 Photonic Quantum Computer: Error Measurement, Characterization, and Implications

Post-Moore Computing

Quantum Computing

DescriptionAmong the various types of quantum computers, photonic quantum computers have shown great potential due to their high degree of scalability. However, the development of photonic quantum computers is still in its infancy, and the characterization of their performance is of critical importance to guide further improvements. In this work, we present the first characterization and insights derived from Xanadu's X8 photonic quantum computer. Our work represents an important step toward the development of practical and scalable photonic quantum computers.

Workshop

Exploring Benchmarks for Self-Driving Labs Using Color Matching

Large Scale Systems

Performance Measurement, Modeling, and Tools

Software Engineering

DescriptionSelf Driving Labs (SDLs) that combine automation of experimental procedures with autonomous decision making are gaining popularity as a means of increasing the throughput of scientific workflows. The task of identifying a mix of supplied colored pigments that matches a target color, the color matching problem, has emerged as a simple and flexible test case for these labs, as it requires experiment proposal, sample creation, and sample analysis, three common components in automated discovery applications. We present a modular, easily retargetable robotic solution to the color matching problem that allows for fully autonomous execution of a color matching protocol, with feedback from pluggable optimization approaches allowing for continuous refinement and automated publication of results facilitating experiment tracking and post-hoc analysis

Exhibitor Forum

Exploring Converged HPC and AI on the Groq AI Inference Accelerator

Accelerators

Artificial Intelligence/Machine Learning

Architecture and Networks

Hardware Technologies

XO/EX

DescriptionConverged compute infrastructure refers to a trend where HPC clusters are set up for both AI and traditional HPC workloads, allowing these workloads to run on the same infrastructure, potentially reducing under-utilization. Here, we explore opportunities for converged compute with GroqChip™, an AI accelerator optimized for running large-scale inference workloads with high throughput and ultra-low latency. GroqChip features a Tensor Streaming architecture optimized for matrix-oriented operations commonly found in AI, but GroqChip can also efficiently compute other applications such as linear algebra-based HPC workloads.

We consider two opportunities for using the Groq AI accelerator for converged HPC. The first example is a structured grid solver for Computational Fluid Dynamics (CFD). This solver can run in a classical implementation as a direct numerical solver (DNS) using the pressure projection method. In a hybrid AI implementation, the same DNS solver is augmented with CNN-based downscaling and upscaling steps. This enables a reduction of grid size from 2048 to 64, thus significantly reducing the amount of compute necessary while maintaining a similar quality of results after upscaling. A speedup of three orders of magnitude is made possible by the combination of reducing the number of compute steps in the algorithm through introducing AI, and by accelerating both the CNN and DNS stages with GroqChip. The second example is using HydraGNN for materials science and computational chemistry. These problems are typically solved with Density Field Theory algorithms, but recently, Graph Neural Networks (GNNs) have been explored as an alternative. For example, GNNs can be used to predict the total energy, charge density, and magnetic moment for various atom configurations, identifying molecules with desired reactivity. The computation requires many parallel walks of HydraGNN with low batch sizes, and can be solved on GroqChip 30-50x faster than an A100 graphics processor.

Posters

Research Posters

Exploring Green Cryptographic Hashing Algorithms for Eco-Friendly Blockchains

XO/EX

DescriptionCryptographic hash functions are fundamental for ensuring data security and integrity in all consensus algorithms in blockchains. While SHA256 has been widely used in many blockchain implementations, its throughput and efficiency has led the rise of a modern lightweight and speed superior implementation BLAKE3. We compared and contrasted SHA256 and BLAKE3 with a focus on blockchain workloads with small inputs and outputs. We explored different compilers and optimizations, different ways to parallelize using multi-threading and multi-processing, as well as different size systems from small Raspberry Pi 4 to a modern AMD Epyc server. We found that BLAKE3 is superior from a performance perspective. To showcase its strengths, we integrated BLAKE3 into a basic Proof-of-Space implementation that used advanced data index and search, and compared our results to the Chia blockchain plotting mechanism. Our approach offers one to two orders of magnitude higher hash generation and storage rates.

Posters

Research Posters

Exploring Julia as a Unifying End-to-End Workflow Language for HPC on Frontier

XO/EX

DescriptionWe evaluate the use of Julia as a single language and ecosystem paradigm powered by LLVM for the development of high-performance computing (HPC) workflow components. A Gray-Scott 2-variable diffusion-reaction application using a memory-bound 7-point stencil kernel is run on Frontier, the first exascale supercomputer. We evaluate the feasibility, performance, scaling, and trade-offs of (i) the computational kernel on AMD's MI250x GPUs, (ii) weak scaling up to 4,096 MPI processes/GPUs or 512 nodes, (iii) parallel I/O write using the ADIOS2 library bindings, and (iv) Jupyter Notebooks for interactive data analysis.

We will discuss our results which show that although Julia generates a reasonable LLVM-IR kernel, there is nearly a 50% performance difference with native AMD HIP stencil codes on GPU. We observed near-zero overhead when using MPI and parallel I/O bindings to system-wide installed implementations. Consequently, Julia emerges as a compelling high-performance and high-productivity workflow composition strategy as measured on Frontier.

Posters

Research Posters

Exploring the Impacts of Multiple I/O Metrics in Identifying I/O Bottlenecks

XO/EX

DescriptionHPC systems, driven by the rise of workloads with significant data requirements, face challenges in I/O performance. To address this, a thorough I/O analysis is crucial to identify potential bottlenecks. However, the multitude of metrics makes it difficult to pinpoint the causes of low I/O performance. In this work, we analyze three scientific workloads using three widely accepted I/O metrics. We demonstrate that different metrics uncover different I/O bottlenecks, highlighting the importance of considering multiple metrics for comprehensive I/O analysis.

Workshop

Exploring the Potential of GPU-initiated Communications in HPC Applications

State of the Practice

DescriptionThe two-sided MPI has become a de facto standard for communication on a distributed memory system. As high-performance GPU computing becomes the trend, some numerical methods have a relatively simple communication pattern adhering to the BSP model find MPI and its CUDA-aware variant can satisfy the performance requirements. Conversely, DAG-like computations have a more complex communication pattern and find it hard to scale to multiple GPUs. Thus, GPU-initiated one-sided communication becomes a viable solution for multi-GPU scaling. However, the lack of deep understanding of GPU-initiated one-sided communication and the real impact on applications performance become a hurdle. In this work, we use a multi-GPU SpTRSV, which is used in conjunction with Sparse LU for solving sparse linear systems, either as a direct solver or as a precondition, to demonstrate our multi-GPU SpTRSV implementation using NVSHMEM achieves up to 3x speedup on Perlmutter compared to a single GPU implementation.

ACM Gordon Bell Finalist

Awards

Exploring the Ultimate Regime of Turbulent Rayleigh–Bénard Convection through Unprecedented Spectral-Element Simulations

DescriptionWe detail our developments in the high-fidelity spectral-element code Neko that are essential for unprecedented large-scale direct numerical simulations of fully developed turbulence. Major innovations are modular multi-backend design enabling performance portability across a wide range of GPUs and CPUs, a GPU-optimized preconditioner with task overlapping for the pressure-Poisson equation and in-situ data compression. We carry out initial runs of Rayleigh–Bénard Convection (RBC) at extreme scale on the LUMI and Leonardo supercomputers. We show how Neko is able to strongly scale to 16,384 GPUs and obtain results that are not possible without careful consideration and optimization of the entire simulation workflow. These developments in Neko will help resolving the long-standing question regarding the ultimate regime in RBC.

Posters

Research Posters

Exploring Userspace Memory Mapping for RDMA-Enabled Network-Attached Memory

XO/EX

DescriptionMemory-bound applications like graph processing applications often require large memory capacity beyond a single node. Current HPC systems over-provision compute and memory resources to meet requirements of diverse workloads. In this work, we explore using network-attached memory for disaggregating memory from compute nodes to satisfy the demand of memory-intensive workloads. We provide a library that enables applications to access network-attached memory as if in its main memory, and exposes critical controls to userspace, including concurrency level and page-level data compression. Our preliminary results show that the flexibility of tuning concurrency and compression is important for improving performance and reducing data movement. Also, our results on 12 scientific data sets indicate that DPU compression offloading could significantly speed up compression and is important for future optimizations.

Workshop

Extensions to the SENSEI In Situ Framework for Heterogeneous Architectures

Data Analysis, Visualization, and Storage

Large Scale Systems

Performance Measurement, Modeling, and Tools

DescriptionThe presence of GPUs and accelerators in recent super computing
systems, so called heterogeneous architectures, has lead to increased
complexity in execution environments and programming models
as well as deeper memory hierarchies on these systems. In this
work we discuss challenges that arise in in situ code coupling on
heterogeneous architectures. We present data and execution model
extensions to the SENSEI in situ framework targeted at effective use
of systems with heterogeneous architectures. We use the new data
and execution model extensions in SENSEI to investigate a number
of in situ placement and execution configurations and analyze the
impact these choices have on overall performance.

Workshop

Extra-Deep: Automated Empirical Performance Modeling for Distributed Deep Learning

Performance Measurement, Modeling, and Tools

Programming Frameworks and System Software

DescriptionWith the rapidly increasing size and complexity of DNNs, equally sophisticated methods are needed to train them efficiently, including distributed training and various model/hybrid parallelism approaches. Even though developers heavily rely on state-of-the-art frameworks such as PyTorch and TensorFlow, these provide little insight into an application's training behavior at scale, leading to latent performance bottlenecks and inefficient training configurations. We propose Extra-Deep, an automated empirical performance modeling approach for distributed deep learning. We leverage the created models to analyze a training task's performance, scalability, efficiency, and cost. Using an efficient sampling strategy that reduces the profiling time for the required empirical measurements by, on average, about 94.9%, we can identify cost-effective training configurations even for large-scale applications. We evaluated our approach on three parallelization strategies, with four DNN models and five datasets. The results show that Extra-Deep has an average prediction accuracy of 93.6% when compared to empirical results.

Workshop

Faculty Development Workshops for Integrating PDC in Early Undergraduate Curricula: An Experience Report

Education

State of the Practice

DescriptionParallel and Distributed Computing (PDC) has become pervasive and is now exercised on a variety of platforms. Most students in computer science (CS) and computer engineering (CE) programs are still introduced to computational problem solving using an old model, in which all processing is serial and synchronous, with input and output via text using a terminal interface or a local file system.

Teaching a range of PDC knowledge and skills at multiple levels in CS and related CE curricula is essential. The authors of this paper conducted a series of week-long faculty training workshops on the integration of PDC topics in CS1 and CS2 classes, and this paper provides an experience report on the impact and effectiveness of these workshops. Our survey results indicate such faculty development workshops can be effective in gradual inclusion of PDC in early computing curricula.

Workshop

FAIRIST of Them All: Meeting Researchers Where They Are With Just-in-Time, FAIR Implementation Advice

Data Analysis, Visualization, and Storage

Large Scale Systems

Programming Frameworks and System Software

Reproducibility

Resource Management

Runtime Systems

Paper

FASDA: An FPGA-Aided, Scalable, and Distributed Accelerator for Range-Limited Molecular Dynamics

Accelerators

Applications

Architecture and Networks

Modeling and Simulation

DescriptionConducting long-timescale simulations of small molecules using Molecular Dynamics (MD) is crucial in drug design. However, traditional methods to accelerate the process, including ASICs or GPUs, have limitations. ASIC solutions are not always generally available, while GPU solutions may not scale when processing small molecules. FPGAs are both communication processors and accelerators, with tight coupling between these capabilities, and so could be used to address strong scaling in this domain.

We present FASDA, the first FPGA-based MD accelerator available for community development. FASDA enables the use of FPGA enhanced clusters and clouds to execute range-limited MD, which is the most resource-intensive and computation-demanding component in MD. FASDA is built with a series of plugable components that are adjustable based on user requirements and demonstrates nearly linear scaling on an eight FPGA cluster. It outperforms the state-of-the-art GPU solution by 4.67x, with the resulting prospect of significantly reducing lead evaluation time.

Workshop

Fast 2D Bicephalous Convolutional Autoencoder for Compressing 3D Time Projection Chamber Data

Data Analysis, Visualization, and Storage

Data Compression

DescriptionHigh-energy large-scale particle colliders produce data at high speed in the order of 1 terabytes per second in nuclear physics and 1 petabytes per second in high-energy physics. Time projection chamber tracking detector data are usually very sparse, which presents a challenge to conventional learning-free lossy compression algorithms such as SZ, ZFP, and MGARD. The 3D convolutional neural network (CNN)-based approach named Bicephalous Convolutional Autoencoder (BCAE) outperforms traditional methods both in compression rate and in reconstruction accuracy. BCAE can also utilize the computation power of graphical processing units. Here, we introduce an improved 3D CNN that achieves X% better compression ratio and Y% better reconstruction accuracy measured in mean absolute error comparing to BCAE. We also introduce a novel 2D CNN variant by treating the radial direction as the channel dimension, resulting a 3x in compression throughput without losing too much in reconstruction accuracy.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Fast Checkpointing of Large Language Models with TensorStore CHFS

XO/EX

DescriptionThe frequency of checkpoint creation in large language models is limited by the write bandwidth to a parallel file system. In this study, we aim to reduce the checkpoint creation time by writing to the Intel Optane Persistent Memory installed on the compute nodes.

We propose TensorStore CHFS, a storage driver that adds an ad hoc parallel file system CHFS to the TensorStore. The proposed method succeeded in increasing the checkpoint creation bandwidth of the T5 1.1 model by 4.5 times on 32 nodes.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Fast Operations on Compressed Arrays without Decompression

XO/EX

DescriptionIn modern scientific computing and machine learning systems, data movement has overtaken compute as the performance bottleneck, thus motivating the wider adoption of lossy data compression. Unfortunately, state-of-the-art floating-point array compressors such as SZ and ZFP require decompression before operations can be performed on the data. In this work, our contribution is to show that compression methods can be designed to allow efficient operations on compressed arrays without having to first decompress. In particular, compression methods that consist of only linear transformations and quantization allow certain operations on compressed arrays without decompression. We develop such a compression method, called PyBlaz, the first compression method we know that can compress arbitrary-dimensional arrays and directly operate on the compressed representation, with all stages running on GPUs.

In the poster session, I will provide details about each compression step, several compressed-spaced operations, and our ongoing performance and application experiments.

Workshop

Fast Simulation of High-Depth QAOA Circuits

Quantum Computing

Software Engineering

DescriptionUntil high-fidelity quantum computers with a large number of qubits become widely available, classical simulation remains a vital tool for algorithm design, tuning, and validation. We present a simulator for the Quantum Approximate Optimization Algorithm (QAOA). Our simulator is designed with the goal of reducing the computational cost of QAOA parameter optimization and supports both CPU and GPU execution. Our central observation is that the computational cost of both simulating the QAOA state and computing the QAOA objective to be optimized can be reduced by precomputing the diagonal Hamiltonian encoding the problem. We reduce the time for a typical QAOA parameter optimization by eleven times for n = 26 qubits compared to a state-of-the-art GPU quantum circuit simulator based on cuQuantum. Our simulator is available on GitHub: https://github.com/jpmorganchase/QOKit

Workshop

fAsyLex: Accelerating Legal NLP through Comparative Analysis of Multi-GPU Approaches

State of the Practice

DescriptionThe primary objective of this work is to conduct an evaluation of the acceleration of NLP training for the task of text classification on legal documents. The dataset used is AsyLex, a dataset of refugee claims from Canada. We implement fast AsyLex (fAsylex) and scale it across up to 64 GPUs. Through systematic experimentation, we seek to address the following research questions: How does the training time differ between single-GPU and multi-GPU setups for two commonly used PLMs? Does the choice of training approach (single-GPU vs. multi-GPU) influence the classification performance on the chosen dataset? We offer an investigation into the practical implications of employing single-GPU and multi-GPU training, we compare two of the most commonly used masked language models, RoBERTa and DeBERTa and reduce runtime out-of-the-box by 49% and 37% respectively, and we demonstrate that there is a trade-off in terms of NLP metrics and distributed training.

Tutorial

Fault-Tolerance for High-Performance and Big Data Applications: Theory and Practice

Algorithms

Data Movement and Memory

Fault Handling and Tolerance

TUT

DescriptionResilience is a critical issue for large-scale platforms. This tutorial provides a comprehensive survey of fault-tolerant techniques for high-performance and big data applications, with a fair balance between theory and practice. This tutorial is organized across four main topics:

(i) Overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal);

(ii) General-purpose techniques, which include several checkpoints and rollback recovery protocols, replication, prediction, and silent error detection;

(iii) Application-specific techniques, such as user-level in-memory checkpointing, data replication (map-reduce), or fixed-point convergence for iterative applications (back-propagation);

(iv) Practical deployment of fault tolerance techniques with User Level Fault Mitigation (MPI standard extension). Relevant examples will include widely used routines such as Monte-Carlo methods, SPMD stencil, map-reduce, and back-propagation in neural networks.

A step-by-step approach will show how to protect these routines and make them fault-tolerant, using a variety of techniques, in a hands-on session.

The tutorial is open to all SC23 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific and big data applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models. However, basic knowledge of MPI will be helpful for the hands-on session.

Workshop

Featured Talk: Aurora Exascale Architecture

Large Scale Systems

Middleware and System Software

Programming Frameworks and System Software

DescriptionAurora is an exascale supercomputer in the final stages of assembly at the Argonne Leadership Computing Facility (ALCF) in the U.S. This talk will focus on the Aurora hardware and software architectures with emphasis on the interconnect and programming models, and their impact on application performance and scalability.

Workshop

Federated Learning with Healthcare Data at Scale

Artificial Intelligence/Machine Learning

Distributed Computing

Workshop

FFTX-IRIS: Toward Performance Portability and Heterogeneity for SPIRAL Generated Code

Accelerators

Edge Computing

Heterogeneous Computing

DescriptionFFTX-IRIS is a dynamic system to efficiently utilize novel heterogeneous platforms. This system links two next generation frameworks, FFTX and IRIS, to navigate the complexity of different hardware architectures. FFTX provides a runtime code generation framework for high performance Fast Fourier Transform kernels. IRIS runtime provides portability and multi-device heterogeneity, allowing computation on any available compute resource. Together, FFTX-IRIS enables code generation, seamless portability, and performance without user involvement. We show the design of the FFTX-IRIS system along with an evaluation of various small FFT benchmarks. We also demonstrate multi-device heterogeneity of FFTX-IRIS with a larger stencil application.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

File Aggregation for Asynchronous Multi-Level Checkpointing

DescriptionCheckpointing serves numerous functionalities in modern-day HPC systems and applications. In recent years, synchronous checkpointing, which blocks the application until checkpoints are persisted to external storage, suffers rising synchronization overheads at scale, resulting in little forward progress by the application. Therefore, asynchronous checkpointing has become more popular by quickly capturing checkpoints locally and flushing them in the background concurrently alongside the application. State-of-the-art solutions like VELOC utilize a file-per-process strategy, which is difficult for users and parallel file systems to manage. We implement a tunable N-to-M aggregation strategy within VELOC, obtaining 2.5x greater throughput than state-of-the-art aggregation library ADIOS2 and 1.5x higher throughput than the naive N-to-1 aggregation currently supported by VELOC.

Workshop

Filtering and Ranking of Code Regions for Parallelization via Hotspot Detection and OpenMP Overhead Analysis

Performance Measurement, Modeling, and Tools

Programming Frameworks and System Software

DescriptionMany high-performance computing applications reach millions of code lines and hundreds of code regions. Analyzing all code regions for parallelization with OpenMP is neither efficient nor necessary. To facilitate this task and minimize the effort by the user, the code regions of the application need to be filtered and ranked. We provide a simple filtering method to detect the critical code regions by clearly defining a hotspot. Afterward, we identify parallelizable loops by analyzing their data dependencies using an automatic tool. As the number of parallel opportunities can be high and the users must verify these parallel suggestions, we suggest a ranking strategy based on parallelization overhead to help them prioritize their endeavors and present a set of OpenMP microbenchmarks for overhead analysis. We calculate optimistic expected benefits using overhead estimations as ranking metrics and show how our ranking provides an improvement on the ranking based on serial runtime.

Workshop

Filtering Wasteful Vertex Visits in Breadth-First Search

Algorithms

Applications

Architecture and Networks

DescriptionBreadth-First Search (BFS) is a common building block for several graph processing algorithms today. In this work, we highlight that a large fraction of vertex visits across the network in distributed BFS results in wasteful work. We investigate methods to identify and filter such wasteful cross-network vertex visits to save network bandwidth for energy and performance improvements. We analyze the metadata requirements to perform such filtering in modern hierarchical distributed architectures and identify the tradeoffs between storage and filtering rate. We perform our experiments using the graph500 benchmark and provide a model to scale results to larger graphs. Finally, we propose heuristics to reduce the storage for a BFS message filter and explore the design space for implementing such filtering logic in software, hardware, or a combination.

Workshop

Fine-Grained Accelerator Partitioning for Machine Learning and Scientific Computing in Function as a Service Platform

Middleware and System Software

Programming Frameworks and System Software

Runtime Systems

DescriptionFunction-as-a-service (FaaS) is a promising execution environment for high-performance computing (HPC) and machine learning (ML) applications, as it offers developers a simple way to write and deploy programs. Nowadays, GPUs and other accelerators are indispensable for HPC and ML workloads. However, we have observed that state-of-the-art FaaS frameworks usually treat accelerators as a single device to run a single workload and have little support for multiplexing accelerators.

In this work, we have presented techniques to multiplex GPUs with Parsl, a popular FaaS framework. With our enhancements, we show up to 60% lower task completion time and 250% improvement in the throughput of a large language model when multiplexing a GPU vs running without multiplexing. We plan to extend the support for GPU multiplexing in FaaS platforms by tackling the challenges of changing compute resources in the partition and approximating how to right-size a GPU partition for a function.

Paper

Fine-Grained Policy-Driven I/O Sharing for Burst Buffers

Data Analysis, Visualization, and Storage

I/O and File Systems

State of the Practice

DescriptionA burst buffer is commonly deployed on large-scale supercomputers to bridge the performance gap between the shared file system and the I/O needs of modern supercomputing applications. Existing I/O sharing methods either require resource isolation, offline profiling, or repeated execution that significantly limit the utilization and applicability of these systems. Here we present ThemisIO, a policy-driven I/O sharing framework for a remote-shared burst buffer. ThemisIO can accurately and efficiently allocate I/O cycles among applications purely based on real-time I/O behavior, without requiring user-supplied information or offline-profiled application characteristics. By exploiting a statistical token-based strategy, ThemisIO can precisely balance I/O cycles between applications via time slicing to enforce processing isolation, enabling a variety of fair sharing policies. Our experiments show that ThemisIO sustains 13.5–13.7% higher I/O throughput and 19.5–40.4% lower performance variation than existing algorithms. For applications, ThemisIO significantly reduces or nearly eliminates the slowdown caused by I/O interference.

Workshop

First International Workshop on HPC Testing and Evaluation of Systems, Tools, and Software (HPCTESTS 2023)

Programming Frameworks and System Software

State of the Practice

DescriptionThis workshop brings together HPC researchers, practitioners, and vendors from around the globe to present and discuss state-of-the-art HPC system testing methodologies, tools, benchmarks, tests, procedures, and best practices. The increasing complexity of HPC architectures requires a larger number of tests in order to thoroughly evaluate the status of the system after its installation or a software upgrade before it is transitioned to production users. Therefore, HPC centers and vendors use different methodologies to evaluate their systems during its lifetime, not only at the beginning during the installation and acceptance time, but also regularly during maintenance windows. This workshop will provide a venue to present and discuss the latest HPC system test technologies. The event will include a keynote focused on current HPC system testing topics, followed by a series of paper presentations from peer-reviewed accepted submissions, and will conclude with a panel discussion.

Birds of a Feather

First Steps toward Adopting Direct Liquid Cooled HPC

Energy Efficiency

State of the Practice

Sustainability

XO/EX

DescriptionLiquid cooling mitigates the effects of heat density, reduces energy consumption and increases performance. It is now a requirement to stay on the chip technology roadmap. After a decade's experience with liquid cooling in large-scale supercomputing centers, many data centers are still facing challenges with adoption. Building on deep expertise from major supercomputing centers, this BoF will present recommendations for initial adoption of direct liquid cooling (DLC). See https://sites.google.com/lbl.gov/ee-hpc-wg-liquid-cooling/home. There will be presentations on experiences from sites that have just adopted DLC. We are expecting a lot of audience discussion and networking that extends beyond the BoF.

Paper

FISCO-BCOS: An Enterprise-Grade Permissioned Blockchain System with High-Performance

Cloud Computing

Distributed Computing

Energy Efficiency

Performance Measurement, Modeling, and Tools

DescriptionEnterprise-grade permissioned blockchain systems provide a promising infrastructure for data sharing and cooperation between different companies. However, performance bottlenecks seriously hinder the adoption of these systems in many industrial applications that process complex business logic and huge transaction volumes.

In this paper, we present FISCO-BCOS, an enterprise-grade permissioned blockchain system with high performance. We conducted experiments on two popular test platforms and compared FISCO-BCOS with state-of-the-art platforms in academia and industry such as BIDL and Hyperledger Fabric (HLF). The result shows that FISCO-BCOS achieves 7.4 times and 28.4 times the throughput of BIDL and HLF, respectively, with half the latency of them. FISCO-BCOS has already been used in over 300 different large-scale industrial scenarios and has become one of the most popular permissioned blockchains.

Workshop

Fluxion: A Scalable Graph-Based Resource Model for HPC Scheduling Challenges

Data Analysis, Visualization, and Storage

Large Scale Systems

Programming Frameworks and System Software

Reproducibility

Resource Management

Runtime Systems

DescriptionThe current era of exascale supercomputing and the emergence of a computing continuum present several significant resource management challenges. These include, but are not limited to, management of complex scientific workflows, diverse resources such as power, elasticity in user jobs, and converged environments. The resource models that underpin today's job scheduling frameworks reflect the node- (or core-) centric system architectures prevalent when the frameworks were designed. Consequently, they are not suited to capturing resource relationships or dynamism. This greatly limits their applicability to the emerging multifaceted challenges in high-performance computing (HPC) and other converged environments. We propose a scalable graph-based resource model to overcome these challenges, which allows for representation of complex, changing resource relationships and multiple containment hierarchies. We implement this model, Fluxion, in a production-quality framework, and evaluate its performance. Additionally, we present emerging and advanced scheduling use cases that are enabled by our model.

Paper

FORGE: Pre-Training Open Foundation Models for Science

Artificial Intelligence/Machine Learning

Applications

Modeling and Simulation

State of the Practice

DescriptionLarge language models (LLMs) are poised to revolutionize the way we conduct scientific research, yet their complexity and cost hinder adoption by the wider science community. Identifying suitable scientific use cases, optimizing model and data sizes, and scaling up training are among the most pressing issues. Here we provide practical solutions for building and using LLM-based foundation models targeting scientific use cases. We present an end-to-end examination of the effectiveness of LLMs in scientific research, including their scaling behavior and computational requirements on Frontier, the first exascale supercomputer. We have also developed for release to the scientific community a suite of open foundation models called FORGE with up to 26B parameters using 257B tokens from over 200M scientific articles. We have demonstrated the use and effectiveness of FORGE on scientific downstream tasks. Our research establishes best practices that can be applied across various fields to utilize LLMs for scientific discovery.

Workshop

Fortran Performance Optimisation and Auto-Parallelization by Leveraging MLIR-Based Domain Specific Abstractions in Flang

Compilers

Heterogeneous Computing

Performance Optimization

DescriptionMLIR has become popular since it was open sourced in 2019. A sub-project of LLVM, the flexibility provided by MLIR to represent Intermediate Representations (IR) as dialects at different abstraction levels, to mix these, and to leverage transformations between dialects provides opportunities for automated program optimisation and parallelisation. In addition to general purpose compilers built upon MLIR, domain specific abstractions have also been developed.

In this paper, we explore complimenting the Flang MLIR general purpose compiler by combining with the domain specific Open Earth Compiler’s MLIR stencil dialect. Developing transformations to discover and extracts stencils from Fortran, this specialisation delivers between a 2- and 10-times performance improvement for our benchmarks on a Cray supercomputer compared to using Flang alone. Furthermore, by leveraging existing MLIR transformations we develop an auto-parallelisation approach targeting multi-threaded and distributed memory parallelism, and optimised execution on GPUs, without any modifications to the serial Fortran source code.

Workshop

Fostering Diversity, Equity, and Inclusion (DEI) at Big Tech Firms

State of the Practice

DescriptionAchieving a diverse and inclusive workforce requires focus and commitment, both at the organizational level and the individual level. Last year, Google achieved its most diverse, representative workforce yet. Getting there required targeted actions, along with following the data to understand which initiatives were delivering on their potential and which were not. We will share our perspectives and stories on DEI in a large technical firm, along with some ideas on how to create and expand opportunities for underrepresented groups.

Inclusivity

Fostering Inclusivity in Research Computing and Data: Case Studies, Best Practices, and Scaling Strategies

Inclusivity

TUT

XO/EX

DescriptionIn the pursuit of advancing research computing and data (RCD) Team Science, it is crucial to embrace inclusivity and broaden engagement across diverse academic institutions. This CASC-shared SC23 Inclusivity session aims to facilitate discussions and knowledge sharing among all participants on how to foster greater involvement among Minority-Serving Institutions (MSIs), Historically Black Colleges and Universities (HBCUs) and Tribal Colleges (TCUs) and the CASC higher education institution and high performance computing center members traditionally connected to the RCD enterprise.

This session will bring together case studies from MSIs, HBCUs and TCUs and CASC member partners. These case studies will not only showcase achievements but also delve into the strategies employed to replicate these successes. Attendees will explore best practices for building replicable, sustainable, scalable collaborations without imposing an undue burden the partner institutions involved.

Discussions will center around defining what success looks like in the context of inclusivity and the tangible goals that should be pursued to achieve it. Participants will discuss potential funding opportunities to support these institutional partnerships to enhance research computing infrastructure and support.

While significant strides have been made to integrate inclusivity into research computing, this session will also explore areas where gaps still exist. Attendees will have the opportunity to identify challenges, potential barriers, and strategies for overcoming them. By collectively exploring these issues, the session will work towards a more comprehensive and sustainable approach to broadening engagement in research computing and data sciences.

We welcome participation from all SC23 attendees and all institutions with an interest in the topic.

Workshop

Fourth International Workshop on Quantum Computing Software

Quantum Computing

Software Engineering

DescriptionQuantum computing is emerging as a remarkable technology that promises to achieve major scientific breakthroughs. This includes solving complex problems whose solution lies well beyond contemporary and even future supercomputers based on conventional technologies. Interacting with these quantum computers, including noisy-intermediate scale quantum devices, for both basic and applied research will require a unique collection of software tools.

The purpose of this workshop is to explore the innovative software needed to make quantum computing practical and accessible. The workshop will focus heavily on the tools and software for quantum computing with a particular emphasis on realized implementations.

Topics of interest for this workshop include but are not limited to: Languages, Compilers/Profilers, Quantum Machine Learning Software, Numerical Simulators, Workflows, Debugging/Verification, and Optimal Quantum Control Software.

Topics that are not relevant to the workshop include domain-specific applications of quantum computing, development of quantum computing hardware or devices, and benchmarking of quantum computers.

Workshop

Fourth Workshop on Heterogeneous Memory Systems (HMEM)

Data Movement and Memory

Heterogeneous Computing

DescriptionHeterogeneous memory architectures have recently emerged and revolutionized the traditional memory hierarchy. Today’s architectures may comprise multiple memory technologies next to DRAM, such as: 3D-stacked memory, high-bandwidth multi-channel RAM, persistent memory, or Compute Express Link (CXL)-based architectures.

Even though heterogeneous memory architectures can benefit applications in terms of improved performance, energy-efficiency, and cost trade-offs, exploiting the full potential of such complex architectures poses significant challenges. Since heterogeneous memory architectures introduce dramatic disruptions to the usual memory hierarchy assumptions that have guided decades of system and software design, we need to rethink solutions across all the layers of system and software stack to embrace the new era of memory heterogeneity and satisfy modern applications demands.

As in previous years, the workshop on Heterogeneous Memory systems (HMEM) will serve as a forum to bring together researchers from the HPC community to present and discuss ongoing research around heterogeneous memory systems.

Exhibitor Forum

From Bugs to Breakthroughs: Harnessing HPC Software Debuggers for Success

Artificial Intelligence/Machine Learning

Fault Handling and Tolerance

Large Scale Systems

Programming Frameworks and System Software

XO/EX

DescriptionDebugging today’s complex HPC applications can be a challenge, often requiring multiple hardware technologies, different software libraries to facilitate parallelism, applications built with multiple languages, handling issues of scale, and working on remote clusters. This all creates a complex environment that makes it difficult to find and fix problems in code.

This interactive session highlights the important debugging technologies and techniques for effectively finding and solving challenging issues in HPC applications. You will learn:

• The advantages of parallel debuggers over traditional debuggers
• How to simultaneously debug CPU and either NVIDIA GPU or AMD GPU code
• How to easily debug hybrid MPI and OpenMP applications
• How to combine advanced debugging features to efficiently tackle tough parallel problems
• How to leverage powerful tools such as reverse debugging and memory debugging to solve elusive bugs

Taking full advantage of the TotalView debugger's many features will help you improve your productivity by streamlining the debugging process and reducing the time and effort required to identify and fix bugs. You'll also enhance the scalability of your application by gaining insights into the parallel execution of your program.

Being able to identify and resolve hard-to-find errors will result in more robust, reliable HPC applications.

Exhibitor Forum

From Stencils to Tensors: Running 3D Finite Difference Seismic Imaging on the Groq AI Inference Accelerator

Accelerators

Artificial Intelligence/Machine Learning

Architecture and Networks

Hardware Technologies

XO/EX

DescriptionGroqChip™ is an AI accelerator optimized for running large-scale inference workloads with high throughput and ultra-low latency. It features a Tensor Streaming architecture optimized for matrix-oriented operations commonly found in AI, but the chip can also efficiently compute other applications such as HPC workloads that can be expressed as large-scale matrix multiplication. GroqChip uses a deterministic dataflow execution model that results in predictable and repeatable performance without runtime variation, and its RealScale™ chip-to-chip interconnect technology makes it possible to scale applications across cards in a node, or nodes in a rack, without hitting the bottlenecks of PCIe or the network.

Here, we explore how GroqChip and its architecture can be used to deliver high performance for linear algebra-based applications in HPC. Seismic imaging typically involves a 3D finite difference solver, which involves 3D stencil computations on a volume of data. The original stencil algorithm is not well-suited to run on a tensor-based architecture, but we outline how stencil operation can be transformed into tensor operations by decomposing the stencil and recomposing it into matrices. The finite difference step can now be solved by matrix multiplications and matrix transpositions. A single GroqChip can run the finite difference step for a sub-cube of data which is fully kept in on-chip memory, while larger volumes are computed by mapping the computation to a full rack or several racks. Halo data is exchanged between GroqChip processors via RealScale interconnect, enabling the scaling of the application’s domain size without PCIe or internode communication becoming the bottleneck. The deterministic dataflow model supports efficient orchestration of data movements within the chip and between chips without ever stalling the compute units. Finally, numerical analysis and optimization allows us to leverage of Groq TruePoint™ arithmetic to satisfy the numerical requirements of seismic imaging.

Tutorial

From Zero to Hero: Conquering the Arm Neoverse

Architecture and Networks

Middleware and System Software

TUT

DescriptionArm technology has increasingly become a compelling choice for HPC due to its promise of higher efficiency, density, scalability, and broad ecosystem of software. Arm expansion in the datacentre started in 2018 with Arm Neoverse, a set of infrastructure CPU IPs designed for high-end computing. The Arm-based Fugaku supercomputer, first of its kind implementing Arm SVE instruction set, entered the Top 500 in June 2020 scoring at the top and retaining a leadership position over the years not only in HPL but also for HPCG (where it is still unbeaten). This event has been a wake-up call for the HPC community. The datacentre and HPC space have long been dominated by x86 CPUs. There is a growing interest in diversifying and exploring new architectures to re-create a vibrant and diverse ecosystem of architectures as it was more than a decade ago. Arm technology is at the forefront of this wave of change. This tutorial welcomes scientists and engineers interested in running a variety of workloads on a Arm-based system, either on-premises or in the cloud. The tutorial will guide the attendee through compile, execute, profile and optimize codes for Arm to demystify those claims that changing CPU architecture is hard.

Paper

Frontier: Exploring Exascale

Exascale

Large Scale Systems

State of the Practice

Best Paper Finalist

DescriptionAs the US Department of Energy (DOE) computing facilities began deploying petascale systems in 2008, DOE was already setting its sights on exascale. In that year, DARPA published a report on the feasibility of reaching exascale. The report authors identified several key challenges in the pursuit of exascale including power, memory, concurrency, and resiliency. That report informed the DOE's computing strategy for reaching exascale. With the deployment of Oak Ridge National Laboratory's Frontier supercomputer, we have officially entered the exascale era. In this paper, we discuss Frontier's architecture, how it addresses those challenges, and describe some early application results from Oak Ridge Leadership Computing Facility's Center of Excellence and the Exascale Computing Project.

Workshop

FROOM: A Framework of Operators for OTF2 Modification

Performance Measurement, Modeling, and Tools

Programming Frameworks and System Software

DescriptionIn recent years, High Performance Computing (HPC) has become increasingly important for many industries and research areas besides ‘classic’ applications. As new domains emerge, applications, implementations and frameworks become more diverse. Generic performance analysis tools often cannot keep up with the development speed of new approaches for workload distribution, offloading, and communication. Some of the new approaches employ their own performance monitoring, which is difficult to integrate into generic tools designed for traditional HPC. Performance measurements often result in a collection of separate performance logs that logically form a unit but cannot intuitively be investigated together with established performance tools. We present a tool library that can be used to combine separate performance logs and separately recorded metrics into one single performance log, enabling investigation of such performance data as a unit. Use cases from Big Data processing and AI show the broad applicability of our approach.

Workshop

FTXS 2023 – Afternoon Break

Fault Handling and Tolerance

Large Scale Systems

Workshop

FTXS 2023 – Closing Remarks

Fault Handling and Tolerance

Large Scale Systems

DescriptionThank you for attending FTXS 2023. See you next year!

Workshop

FTXS 2023 – Opening Remarks

Fault Handling and Tolerance

Large Scale Systems

DescriptionIntroduction and welcome to FTXS 2023.

Workshop

FTXS 2023 : Invited Speaker (Paolo Rech, "Quantum Computing Reliability: Problems, Tools, and Potential Solutions")

Fault Handling and Tolerance

Large Scale Systems

DescriptionQuantum computing is a new computational paradigm, expected to revolutionize the computing field in the next few years. Qubits, the atomic units of a quantum circuit, exploit the quantum physics properties to increase the parallelism and speed of computation. Unfortunately, qubits are both intrinsically noisy and highly susceptible to external sources of faults, such as ionizing radiation. The reported qubits error rate is so high that researchers are questioning the large-scale adoption of quantum computers and forces unpractical mitigation solutions such as installing the quantum computer in underground caves. Innovative solutions to improve the reliability of quantum applications are then highly necessary.

In the talk, after providing all information and background needed to understand quantum computing basics and an overview of the available quantum technologies vulnerabilities, we will present the available hardening solutions and the open challenges that need to be addressed. We will consider both the intrinsic noise, that has a predictable and incremental effect, and radiation-induced transient faults, that are stochastic and modify the qubit in an unpredictable way. Based on the latest studies and radiation experiments performed on real quantum machines, we will show how to model the transient faults in a qubit and how to inject this fault in a quantum circuit to track its propagation. We will discuss the vulnerability of qubits and of circuits, identifying the most critical parts and the main course for output corruption. Finally, we will provide an overview of the open (reliability) challenges in quantum computing to stimulate further studies and solutions.

Workshop

Future Is Sparse Panel

Graph Algorithms and Frameworks

Linear Algebra

Programming Frameworks and System Software

State of the Practice

DescriptionThis panel will assemble workshop speakers to explore the challenges and opportunities surrounding sparse computation in the domains of HPC and AI. We anticipate that both the moderator and the audience will have excellent and intriguing questions to present to the panelists.

Paper

FuzzyFlow: Leveraging Dataflow to Find and Squash Program Optimization Bugs

Compilers

Performance Measurement, Modeling, and Tools

Performance Optimization

Programming Frameworks and System Software

DescriptionThe current hardware landscape and application scale is driving performance engineers toward writing bespoke optimizations. Verifying such optimizations, and generating minimal failing cases, is important for robustness in the face of changing program conditions, such as inputs and sizes. However, isolation of minimal test-cases from existing applications and generating new configurations are often difficult due to side effects on the system state, mostly related to dataflow. This paper introduces FuzzyFlow: a fault localization and test case extraction framework designed to test program optimizations. We leverage dataflow program representations to capture a fully reproducible system state and area-of-effect for optimizations to enable fast checking for semantic equivalence. To reduce testing time, we design an algorithm for minimizing test inputs, trading off memory for recomputation. We demonstrate FuzzyFlow on exemplary use cases in real-world applications where the approach provides up to 528 times faster optimization testing and debugging compared to traditional approaches.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Genome Assembly Using an Asynchronous Distributed Actor-Based Approach

DescriptionWe use genome assembly as a representative case to showcase the use of the ‘actor model’, a novel programming system for high-performance data-intensive workloads. The actor version of the 𝑘-mer counting kernel shows on average 1.6× speedup over similar MPI implementation. We provide a novel parallel algorithm that leverages the actor model to traverse de Bruijn graphs in a non-blocking, one-directional manner. Our findings highlight the potential of the actor model for writing simple and efficient parallel programs for data-heavy workloads.

Posters

Research Posters

Geospatial Filter and Refine Computations on NVIDIA Bluefield Data Processing Units (DPU)

XO/EX

DescriptionIn this poster, we will show how to leverage Nvidia's Bluefield Data Processing Unit (DPU) in geospatial systems. Existing work in literature has explored DPUs in the context of machine learning, compression and MPI acceleration. We show our designs on how to integrate DPUs into existing high performance geospatial systems like MPI-GIS. The workflow of a typical spatial computing workload consists of two phases - filter and refine. First we used DPU as a target to offload spatial computations from the host CPU. We show the performance improvements due to offload. Next we used DPU for network I/O processing. In network I/O case, the query data first comes to DPU for filtering and then the query goes to CPU for refinement. DPU-based filter and refine system can be useful in other domains like Physics where an FPGA is used to perform the filter to handle Big Data.

Birds of a Feather

Go with the (Energy) Flow: Adaptive Capacity Computing

Middleware and System Software

XO/EX

DescriptionThe increasing reliance on inherently variable Green energy is poised to impact HPC centers fundamentally: they cannot count on a guaranteed supply of grid power, yet could play a significant role in stabilizing the Grid by quickly adapting their load.

“Adaptive Capacity Computing” touches on system architecture, hardware, scheduling and resource management, programming models, and applications with the objective of enabling future HPC centers to react gracefully to varying power profiles, achieving optimal throughput and avoiding loss of computational state wherever possible.

This BoF discusses challenges and approaches to support this paradigm, should it become necessary to do so.

Workshop

GPU Acceleration in Unikernels Using Cricket GPU Virtualization

Middleware and System Software

Programming Frameworks and System Software

Runtime Systems

DescriptionTo achieve maximum performance on current heterogeneous architectures, applications have to be tailored to the available hardware by using special APIs to interact with the hardware resources, such as the CUDA APIs for NVIDIA GPUs. Simultaneously, unikernels emerge as a solution for the increasing overhead introduced by the complexity of modern operating systems and their inability to optimize for specific application profiles. Despite this, there is a lack of support for using GPUs in unikernels. We propose using Cricket GPU virtualization to introduce GPU support to the unikernels RustyHermit and Unikraft. To interface with Cricket, we implement a generic library for using ONC RPCs in Rust. With Cricket and our RPC library, unikernels are able to use GPU resources, even when they are installed in remote machines. This way, we enable the use of unikernels for applications that require the high parallel performance of GPUs to achieve manageable execution times.

Workshop

GPU Graph Processing on CXL-Based Microsecond-Latency External Memory

Applications

Data Movement and Memory

Heterogeneous Computing

I/O and File Systems

Large Scale Systems

Middleware and System Software

Performance Measurement, Modeling, and Tools

Performance Optimization

DescriptionIn GPU graph analytics, the use of external memory such as the host DRAM and solid-state drives is a cost-effective approach to processing large graphs beyond the capacity of the GPU onboard memory. This paper studies the use of Compute Express Link (CXL) memory as alternative external memory for GPU graph processing in order to see if this emerging memory expansion technology enables graph processing that is as fast as using the host DRAM. Through analysis and evaluation using FPGA prototypes, we show that representative GPU graph traversal algorithms involving fine-grained random access can tolerate an external memory latency of up to a few microseconds introduced by the CXL interface as well as by the underlying memory devices. This insight indicates that microsecond-latency flash memory may be used as CXL memory devices to realize even more cost-effective GPU graph processing while still achieving performance close to using the host DRAM.

Posters

Research Posters

GPU-Accelerated Dense Covariance Matrix Generation for Spatial Statistics Applications

XO/EX

DescriptionLarge-scale parallel computing is crucial in Gaussian regressions to reduce the complexity of spatial statistics applications. The log-likelihood function is utilized to evaluate the Gaussian model for a set of measurements in N geographical locations. Several studies have shown a utilization of modern hardware to scale the log-likelihood function for handling large numbers of locations. ExaGeoStat is an example of software that allows parallel statistical parameter estimation from the log-likelihood function. However, generating a covariance matrix is mandatory and challenging when estimating the log-likelihood function. In ExaGeoStat, the generation process was performed on CPU hardware due to missing math functions in CUDA libraries, e.g., the modified Bessel function of the second kind. This study aims to optimize the generation process using GPU with two proposed generation schemes: pure GPU and hybrid. Our implementations demonstrate up to 6X speedup with pure GPU and up to 1.5X speedup with the hybrid scheme.

Workshop

GPU-Based LU Factorization and Solve on Batches of Matrices with Band Structure

Algorithms

Heterogeneous Computing

Large Scale Systems

DescriptionThis paper presents a portable and performance-efficient approach to solve a batch of linear systems of equations using Graphics Processing Units (GPUs). Each system is represented using a special type of matrices with a band structure above and/or below the diagonal. Each matrix is factorized using an LU factorization with partial pivoting for numerical stability. Subsequently, the factors are used to find the solution for as many right hand sides as needed. The width of the band is often small enough that performing a fully dense LU factorization results in poor performance. We follow the standard LAPACK specifications for addressing this type of problems and develop a dedicated solver that runs efficiently on GPUs. No similar solver is currently available in the vendor's software stack, so performance results are shown on both NVIDIA and AMD GPUs relative to a parallel CPU solution utilizing OpenMP for thread-level parallelization.

Workshop

GPUscout: Locating Data Movement-Related Bottlenecks on GPUs

Performance Measurement, Modeling, and Tools

Programming Frameworks and System Software

DescriptionGPUs pose an attractive opportunity for delivering high-performance applications. However, GPU codes are often limited due to memory contention, resulting in overall performance degradation. Since GPU scheduling is transparent to the user, and GPU memory architectures are very complex compared to ones on CPUs, finding such bottlenecks is a very cumbersome process.

In this paper, we present a novel method of systematically detecting the root cause of frequent memory performance bottlenecks on NVIDIA GPUs that we call GPUscout. It connects three approaches to analyzing performance - static CUDA SASS code analysis, sampling warp stalls, and kernel performance metrics. Connecting these approaches, GPUscout can identify the problem, locate the code segment where it originates, and assess its importance.

This paper illustrates the capabilities and the design of our implementation of GPUscout. We show its applicability based on three commonly-used kernels, yielding promising results in terms of accuracy, efficiency, and usability.

Posters

Research Posters

Graph Based Anomaly Detection in Chimbuko: Feasible or Fallible?

XO/EX

DescriptionPerformance anomaly detection can aid in discovering algorithmic inefficiencies or hardware issues in an application’s environment. The Chimbuko framework monitors large-scale workflow applications in real-time and identifies function executions which deviate from accumulated statistics (performance anomalies). Performance anomalies across runs correlate with variation in execution times of an application; quicker resolution of performance anomalies caused by hardware issues improves cluster performance. Anomalous and normal executions are stored as events in Chimbuko. In this study, we investigate the applicability of graph-based deep learning methods for anomaly classification. We hypothesize that transforming data into a graph will allow correlations to be modeled, thus allowing graph-based methods to learn embeddings that can improve the effectiveness of downstream anomaly classification tasks. Our evaluations demonstrate that the graph-based methods yield up to 95% accuracy and outperform a state-of-the-art gradient-based method. Moreover, we provide an explanation of the classification model’s decision-making process through explainable AI techniques.

Paper

Graph3PO: A Temporal Graph Data Processing Method for Latency QoS Guarantee in Object Cloud Storage System

Cloud Computing

Data Analysis, Visualization, and Storage

Graph Algorithms and Frameworks

DescriptionObject cloud storage systems are deployed with diverse applications that have varying latency service level objectives (SLOs), posting challenges for supporting quality of service with limited storage resources. Existing methods provide prediction-based recommendations for dispatching requests from applications to storage devices, but the prediction accuracy can be affected by complex system topology. To address this issue, Graph3PO is designed to combine storage device queue information with system topological information for forming a temporal graph, which can accurately predict device queue states. Additionally, Graph3PO contains the urgency degree model and cost model for measuring SLO violation risks and penalties of scheduling requests on storage device queues. When the urgency degree of a request exceeds a threshold, Graph3PO determines whether to schedule it in the queue or initiate a hedge request to another storage device. Experimental results show that Graph3PO outperforms its competitors, with SLO violation rates 2.8 to 201.1 times lower.

Paper

GRAPHINE: Enhanced Neutral Atom Quantum Computing Using Application-Specific Rydberg Atom Arrangement

Post-Moore Computing

Quantum Computing

Best Paper Finalist

DescriptionMultiple technologies for realizing quantum computing are currently under development. Neutral atom quantum computing is one such promising technology; it offers advantages such as the ability to perform long-distance interactions and gates consisting of more than two qubits. A particular advantage it provides is the flexibility to arrange the qubits in different topologies by customizing atom layouts. We design GRAPHINE, which, to the best of our knowledge, is the first technique to leverage this flexibility to design application-specific topologies for different quantum algorithms based on the structural characteristics of the algorithm circuits. This enables GRAPHINE to improve key performance metrics like the number of gates and pulses by up to 56% and the probability of error by up to 42% on average over widely-used topology designs.

Paper

GraphSet: High Performance Graph Mining through Equivalent Set Transformations

Accelerators

Applications

Graph Algorithms and Frameworks

Performance Measurement, Modeling, and Tools

Programming Frameworks and System Software

DescriptionGraph mining is of critical use in a number of fields such as social networks, knowledge graphs, and fraud detection. As an NP-complete problem, improving computation performance is the main target for current optimizations. Due to excellent performance, state-of-the-art graph mining systems mainly rely on pattern-aware algorithms. Despite previous efforts, complex control flows introduced by pattern-aware algorithms bring large overhead and also impede further acceleration on heterogeneous hardware.

To address these challenges, we propose a set-based equivalent transformation approach for the optimization of pattern-aware graph mining applications, which can leverage set properties to eliminate most control flows and reduce computation overhead exponentially. We implement a high-performance pattern-aware graph mining system supporting both CPU and GPU, namely GraphSet, to automatically apply these transformations. Evaluation results show that GraphSet outperforms state-of-the-art cross-platform and hardware-specific graph mining frameworks by up to 3384.1x and 243.2x (18.0x and 10.2x on average), respectively.

Paper

GreenNFV: Energy-Efficient Network Function Virtualization with Service Level Agreement Constraints

Cloud Computing

Distributed Computing

Energy Efficiency

Green Computing

Programming Frameworks and System Software

State of the Practice

Sustainability

DescriptionNetwork Function Virtualization (NFV) platforms consume significant energy, introducing high operational costs in edge and data centers. This paper presents a novel framework called GreenNFV that optimizes resource usage for network function chains using deep reinforcement learning. GreenNFV optimizes resource parameters such as CPU sharing ratio, CPU frequency scaling, last-level cache (LLC) allocation, DMA buffer size, and packet batch size. GreenNFV learns the resource scheduling model from the benchmark experiments and takes Service Level Agreements (SLAs) into account to optimize resource usage models based on the different throughput and energy consumption requirements. Our evaluation shows that GreenNFV models achieve high transfer throughput and low energy consumption while satisfying various SLA constraints. Specifically, GreenNFV with Throughput SLA can achieve 4.4X higher throughput and 1.5X better energy efficiency over the baseline settings, whereas GreenNFV with Energy SLA can achieve 3X higher throughput while reducing energy consumption by 50%

Workshop

GrIOt: Graph-Based Modeling of HPC Application I/O Call Stacks for Predictive Prefetch

Data Analysis, Visualization, and Storage

Data Movement and Memory

DescriptionModern HPC storage systems use tiers of heterogeneous storage technologies to compromise between capacity, performance, and cost. Prefetching is a technique used in these systems to move the right data at the right time from a slower to a high-performance tier in order to improve performance with a limited cost. Prefetching requires knowledge of the application I/O patterns, which can be extracted through I/O tracing tools or functions call stacks. State-of-the-art solutions based on the latter focus on applications with regular I/O profiles because of scalability issues. In this paper, we present an approach based on I/O call stacks that models I/O patterns for both regular and irregular applications by using directed graphs. We present two different models for prefetching with different trade-offs between complexity and accuracy of the prefetch predictions. Our models were able to predict I/Os with an accuracy of up to 98%, while keeping a lower overhead.

Workshop

H2RC'23 – Morning Break

Architecture and Networks

Paper

Hanayo: Harnessing Wave-Like Pipeline Parallelism for Enhanced Large Model Training Efficiency

Artificial Intelligence/Machine Learning

DescriptionLarge-scale language models have become increasingly challenging and expensive to train. Among various methods addressing this issue, Pipeline Parallelism has been widely employed to accommodate massive model weights within limited GPU memory. This paper introduces Hanayo, a wave-like pipeline parallelism strategy that boasts a concise structure and practical applicability, alongside a high-performance pipeline execution runtime to tackle the challenges of pipeline strategy implementation. Hanayo mitigates the issues of pipeline bubbles and excessive memory consumption prevalent in existing schemes, without resorting to model duplicates as in Chimera. Our evaluation, conducted on four distinct computing clusters and involving both GPT-like and BERT-like architectures with up to 32 GPUs, demonstrates up to a 30.4% increase in throughput compared to the state-of-the-art approach.

Tutorial

Hands-On HPC Application Development Using C++ and SYCL

Accelerators

Applications

Heterogeneous Computing

Quantum Computing

TUT

DescriptionSYCL is a programming model that lets developers support a wide variety of devices (CPUs, GPUs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to an open standard, platform-independent model such as SYCL is essential for modern software developers. SYCL has the further advantage of supporting a single-source style of programming from completely standard C++.

In this tutorial, we will introduce SYCL and provide programmers with a solid foundation they can build on to gain mastery of this language. The main benefit of using SYCL over other heterogeneous programming models is the single programming language approach, which enables one to target multiple devices using the same programming model, and therefore to have a cleaner, portable, and more readable code.

This is a hands-on tutorial. The real learning will happen as students write code. The format will be short presentations followed by hands-on exercises. Hence, attendees will require their own laptop to perform the hands-on exercises.

Tutorial

Hands-On Practical Hybrid Parallel Application Performance Engineering

Accelerators

Applications

Heterogeneous Computing

Performance Optimization

TUT

DescriptionThis tutorial presents state-of-the-art performance tools for leading-edge HPC systems founded on the community-developed Score-P instrumentation and measurement infrastructure, demonstrating how they can be used for performance engineering of effective scientific applications based on standard MPI, OpenMP, hybrid combination of both, and increasingly common usage of accelerators. Parallel performance tools from the Virtual Institute – High Productivity Supercomputing (VI-HPS) are introduced and featured in hands-on exercises with Score-P, Scalasca, Vampir, and TAU. We present the complete workflow of performance engineering, including instrumentation, measurement (profiling and tracing, timing and PAPI hardware counters), data storage, analysis, tuning, and visualization. Emphasis is placed on how tools are used in combination for identifying performance problems and investigating optimization alternatives. Using their own notebook computers, participants will conduct exercises on a contemporary HPC system where remote access will be provided for the hands-on sessions through AWS running an E4S [http://e4s.io] image containing all of the necessary tools. This image supports NVIDIA GPUs using CUDA 12 and Python. This will help to prepare participants to locate and diagnose performance bottlenecks in their own parallel programs.

Workshop

Hardware Specialization: Estimating Monte Carlo Cross-Section Lookup Kernel Performance and Area

Modeling and Simulation

Performance Measurement, Modeling, and Tools

DescriptionHardware specialization is one of the promising directions in the post-Moore era. It is imperative to understand how hardware specialization paradigms can benefit HPC. An essential question revolves around estimating the theoretical performance of an optimally specialized architecture without requiring extensive hardware development expertise and efforts.

Focusing on the Monte Carlo cross-section lookup kernel, known for its notably low resource utilization, we develop a workflow to simulate a specialized architecture's timing and estimate resource usage to answer these questions, leveraging open-source hardware tools. We implement building blocks of the kernel pipeline in the Chisel construction language and generate Verilog codes for resource estimation. Our late-breaking results show that the kernel latency is 46 cycles per lookup while the optimized CPU code takes 680 cycles, and a potential 15k pipeline copies within a 698 mm2 die, reflective of the Intel Xeon Platinum 8180 dimensions.

Birds of a Feather

HDF5: Building on 25 Years of Success

Data Analysis, Visualization, and Storage

XO/EX

DescriptionHDF5 is a critical I/O library for scientific applications. It has been 25 years since its first release in November, 1998. HDF5’s sustainability and adaptation to today’s computational and storage environment would not be possible without feedback and contributions from the HDF5 community. We will begin with a panel who will present case studies on how they use or would like to use HDF5 in current and emerging computational environments. We will then invite our community members to discuss the roadmap, how to contribute to HDF5, and what is required to sustain HDF5 for another 25 years.

Paper

HEAR: Homomorphically Encrypted Allreduce

Distributed Computing

Message Passing

Programming Frameworks and System Software

Best Student Paper Finalist

DescriptionAllreduce is one of the most commonly used collective operations. Its latency and bandwidth can be improved by offloading the calculations to the network. However, no way exists to conduct such offloading securely; in state-of-the-art solutions, the data is passed unprotected into the network. Security is a significant concern for High-Performance Computing applications, but achieving it while maintaining performance remains challenging. We present HEAR, the first high-performance system for securing in-network compute and Allreduce operations based on homomorphic encryption. HEAR implements carefully designed and modified encryption schemes for the most common Allreduce functions and leverages communication domain knowledge in MPI programs to obtain decryption and encryption routines with high performance. HEAR operates on integers and floats with no code base and no or little hardware changes. We design and evaluate HEAR, showing its minimal overhead, and open-source our implementation. HEAR represents the first step towards achieving confidential HPC.

Workshop

Heterogeneous Syslog Analysis: There Is Hope

Artificial Intelligence/Machine Learning

Data Analysis, Visualization, and Storage

State of the Practice

DescriptionHeterogeneous test-bed clusters present a unique challenge in identifying system hardware failures and anomalies as a result of the variation in the ways that errors and warnings are reported through the system log. We present a novel approach for the real-time classification of syslog messages, generated from a heterogeneous test-bed cluster, to proactively identify potential hardware issues and security events. By integrating machine learning models with high-performance computing systems, our system facilitates continuous system health monitoring.

The paper introduces a taxonomy for classifying system issues into actionable categories of problems, while filtering out groups of messages that the system administrators would consider unimportant "noise". Finally we experiment with using newly available large language models as a form of message classifier, and share our results and experience with doing so. Results demonstrate promising performance, and more explainable results compared to currently available techniques, but the computational costs may offset the benefits.

Birds of a Feather

High Performance Computing for Environmental and Earth Sciences

Applications

XO/EX

DescriptionThe growth of climate and earth data brings a pressing need to enhance techniques for its handling with HPC. This is crucial for our understanding of the coupling between the solid Earth, atmosphere, hydrology, and oceans, enabling proactive responses to extremes through improved forecasting of weather, climate change, and sudden disasters like earthquakes. This BoF will discuss the HPC community’s interface with earth data and related engagements of stakeholder communities in climate, environmental, and Earth sciences, including the mathematics and spatial statistics that are involved. It endeavors to begin a targeted approach to the democratization of HPC for earth sciences.

Doctoral Showcase

Posters

High Performance Computing for Optimization of Radiation Therapy Treatment Plans

Applications

DescriptionModern radiation therapy relies heavily on computational methods to design optimal treatment plans (control parameters for the treatment machine) for individual patients. These parameters are determined by constructing and solving a mathematical optimization problem. Ultimately, the goal is to create treatment plans for each patient such that a high dose is delivered to the tumor, while sparing surrounding healthy tissue as much as possible. Solving the optimization problem can be computationally expensive, as it requires both a method to compute the delivered dose in the patient and an algorithm to solve a (in general) constrained and nonlinear optimization problem.

The goal of this thesis project has been to investigate the use of HPC hardware and methods to accelerate the computational workflow in radiation therapy treatment planning. First, we propose two methods to bring the optimization to HPC hardware using GPU acceleration and distributed computing for dose summation and objective function calculation respectively. We show that our methods achieve competitive performance compared to state-of-the-art libraries and scale well, up to the Amdahl’s law limit.

Then, we investigate methods to accelerate interior point methods, a popular algorithm for constrained optimization. We investigate the use of iterative Krylov subspace linear solvers to solve Newton systems from interior point methods and show that we can compute solutions in reasonable time for our problems, in spite of extreme ill-conditioning. This approach presents one avenue by which constrained optimization solvers for radiation therapy could be ported to GPU accelerators.

Workshop

High Performance Python for Science at Scale

Large Scale Systems

Programming Frameworks and System Software

DescriptionThis workshop aims to connect researchers, developers, and Python practitioners to share their experiences scaling Python applications and codes on supercomputers. The goal is to provide a platform for topical discussion of best practices, hands-on demonstrations, and community engagement via open-source contributions to new libraries, runtimes, and frameworks. Based on keynote talks that survey and summarize the best practices and recent success stories, panel sessions that discuss details of implementation and live demo sessions for hands-on enthusiasts – the workshop will serve as a requirement gathering exercise for the future of Python in HPC and science.

Doctoral Showcase

Posters

High Performance Serverless for HPC and Clouds

Cloud Computing

DescriptionFunction-as-a-Service (FaaS) computing brought a fundamental shift in resource management. It allowed for new and better solutions to the problem of low resource utilization, an issue that has been known in data centers for decades. The problem persists as the frequently changing resource availability cannot be addressed entirely with techniques such as persistent cloud allocations and batch jobs. The elastic fine-grained tasking and largely unconstrained scheduling of FaaS create new opportunities. Still, modern serverless platforms struggle to achieve the high performance needed for the most demanding and latency-critical workloads. Furthermore, many applications cannot be “FaaSified” without non-negligible loss in performance, and the short and stateless functions employed in FaaS must be easy to program, debug, and optimize. By solving the fundamental performance challenges of FaaS, we can build a fast and efficient programming model that brings innovative cloud techniques into HPC data centers, allowing users to benefit from pay-as-you-go billing and helping operators to decrease running costs and their environmental impact. My PhD research attempts to bridge the gap between high-performance programming and modern FaaS computing frameworks. I have been working on tailored solutions for different levels of the FaaS computing stack: from computing and network devices to high-level optimizations and efficient system designs.

Paper

High Throughput Training of Deep Surrogates from Large Ensemble Runs

Artificial Intelligence/Machine Learning

DescriptionRecent years have seen a surge in deep learning approaches to accelerate numerical solvers, which provide faithful but computationally intensive simulations of the physical world. These deep surrogates are generally trained in a supervised manner from limited amounts of data slowly generated by the same solver they intend to accelerate. We propose an open-source framework that enables the online training of these models from a large ensemble run of simulations. It leverages multiple levels of parallelism to generate rich datasets. The framework avoids I/O bottlenecks and storage issues by directly streaming the generated data. A training reservoir mitigates the inherent bias of streaming while maximizing GPU throughput. Experiment on training a deep surrogate for the heat equation shows the proposed approach enables training on 8TB of data in 2 hours with an accuracy improved by 47% and a batch throughput multiplied by 13 compared to a traditional offline procedure.

Workshop

High-Level GPU Code: A Case Study Examining JAX and OpenMP

Performance Measurement, Modeling, and Tools

Performance Optimization

DescriptionIn recent years, a new scientific software design pattern has emerged that pairs a Python interface with high-performance kernels in lower-level languages. The rise of general-purpose GPUs necessitates the rewriting of many such kernels, posing challenges in GPU programming and ensuring future portability and flexibility.

This paper investigates the use of high-level frameworks that abstract system architecture details, aiming for straightforward, portable yet performant GPU code. We focus on TOAST, a cosmology software framework designed to take full advantage of a supercomputer, and compare using the JAX Python library with OpenMP target offload compiler directives as porting strategies. While JAX allows kernel code to be written in pure Python, OpenMP target offload is a directive-based strategy that integrates seamlessly with our existing OpenMP-accelerated C++ kernels.

We port a dozen kernels, analyzing development cost, performance, and the viability of using either framework for complex numerical Python applications.

Paper

High-Performance and Programmable Attentional Graph Neural Networks with Global Tensor Formulations

Artificial Intelligence/Machine Learning

Compilers

Performance Measurement, Modeling, and Tools

Performance Optimization

Programming Frameworks and System Software

Tensors

DescriptionGraph attention models (A-GNNs), a type of Graph Neural Networks (GNNs), have been shown to be more powerful than simpler convolutional GNNs (C-GNNs). However, A-GNNs are more complex to program and difficult to scale. To address this, we develop a novel mathematical formulation, based on tensors that group all the feature vectors, targeting both training and inference of A-GNNs The formulation enables straightforward adoption of communication-minimizing routines, it fosters optimizations such as vectorization, and it enables seamless integration with established linear algebra DSLs or libraries such as GraphBLAS. Our implementation uses a data redistribution scheme explicitly developed for sparse-dense tensor operations used heavily in GNNs, and fusing optimizations that further minimize memory usage and communication cost. We ensure theoretical asymptotic reductions in communicated data compared to the established message-passing GNN paradigm. Finally, we provide excellent scalability and speedups of >5x over modern libraries such as Deep Graph Library.

Posters

Research Posters

High-Performance PMEM-Aware Collective I/Os

Architecture and Networks

I/O and File Systems

DescriptionCollective I/Os are widely used to transform small non-contiguous accesses into large contiguous accesses for parallel I/O optimization. The existing collective I/O techniques assume that computer memory is volatile. They are limited both by the size of the buffer, which must be small so data is not lost during a crash, and the communication overhead that occurs during collective I/O. PMIO is a proposed framework to utilize persistent memory (PMEM) for collective I/O, as opposed to DRAM. First, we utilize a log-structured buffer to take advantage of the non-volatility of PMEM. Second, we utilize larger buffers to take advantage of the larger space available on less expensive PMEM. Finally, we implement a two-phase merging algorithm to eliminate the communication overhead. The poster provides an overview of collective I/O and its problems, an introduction to PMEM, an outline of PMIO, and a brief discussion of PMIO's performance.

Workshop

High-Performance Programming and Execution of a Coral Biodiversity Mapping Algorithm Using Chapel

Applications

Distributed Computing

Compilers

Heterogeneous Computing

Message Passing

Programming Frameworks and System Software

Task Parallelism

DescriptionWe will demonstrate how the parallelism and expressiveness of the Chapel programming language are used to achieve an enormous improvement in computational speed for a problem related to coral reef conservation. Chapel’s concise syntax and versatile data structures enable this problem to be solved in under 300 lines of code, while reducing the time to solution from days down to the order of seconds. This improvement is so substantial that it represents a paradigm shift in the way biodiversity can be measured at scale, providing a wealth of novel information for marine ecosystem managers and opening up brand new avenues for scientific inquiry. This paper will review the solution strategy and data structures in Chapel that allowed these improvements to be realized, and will preview future extensions of this work that have been made possible by this drastic speedup.

Paper

High-Performance SVD Partial Spectrum Computation

Algorithms

Linear Algebra

Post-Moore Computing

DescriptionWe introduce a new singular value decomposition (SVD) solver based on the QR-based Dynamically Weighted Halley (QDWH) algorithm for computing the partial spectrum SVD (QDWHpartial-SVD) problems. By optimizing the rational function underlying the algorithms only in the desired part of the spectrum, QDWHpartial-SVD algorithm efficiently computes a fraction (say 1-20%) of the most significant singular values/vectors. We develop a high-performance implementation of QDWHpartial-SVD on distributed-memory manycore systems and demonstrate their numerical robustness. We perform a benchmarking campaign against their counterparts from the state-of-the-art numerical libraries across various matrix sizes using up to 36K MPI processes. Experimental results show performance speedups for QDWHpartial-SVD up to 6X and 2X against PDGESVD from ScaLAPACK and KSVD, respectively. We also report energy consumption for these algorithms and demonstrate how QDWHpartial-SVD can further outperform PDGESVD in that regard by performing fewer memory-bound operations.

Workshop

Highlighting PARCOACH Improvements on MBI

Applications

Software Engineering

DescriptionPARCOACH is one of the few verification tools that relies on a static analysis to detect errors in MPI programs. First focused on the detection of call ordering errors with collectives, it has recently been extended to detect local concurrency errors in MPI-RMA programs. Furthermore, the new version of the tool fixes multiple errors and is easier to use. We present the improvements we made and the results we obtained on the MPI Bugs Initiative.

Workshop

HMEM – Morning Break

Data Movement and Memory

Heterogeneous Computing

Workshop

HMEM – Welcome

Data Movement and Memory

Heterogeneous Computing

DescriptionWelcome to the HMEM Workshop 2023

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

How Much Noise Is Enough: On Privacy, Security, and Accuracy Trade-Offs in Differentially Private Federated Learning

XO/EX

DescriptionCentralized machine learning techniques have caused privacy concerns for users. Federated Learning~(FL) mitigates this as a decentralized training system where no raw data are communicated across the network to a centralized server. Instead, the machine learning model is trained locally on each device and they send the locally-trained model weights to a central server to aggregate. However, there are critical challenges with FL. Security issues plague FL, such as model poisoning via label flipping. Additionally, there even exist privacy concerns via data leakage by reconstruction of weights. In this work, we apply differential privacy (which adds noise to the model weights before sending across the network) as an added privacy measure to protect sensitive data from being reconstructed. Through this research, we study the effects of differential privacy on FL with respect to security and privacy trade-offs.

Early Career Program

Inclusivity

How to Build a Successful Mentorship Relationship

Inclusivity

DescriptionMentorship is a dynamic, career-long phenomenon spanning many different relationships that support our personal and professional development. A wealth of scholarship on mentorship practices has emerged across many disciplines studying how mentorship happens in the workplace, its benefits, and what companies can do to foster those relationships. Of note, numerous studies have linked mentorship with diversity and inclusion; mentorship can support the growth and retention of workers from underrepresented and marginalized groups by “bringing them into the fold” and empowering them. As a software engineering researcher, Reed has actively been investigating the instrumental role that mentorship can play in the careers of women and LGBTQIA+ individuals in tech. In this talk, he will make the case for how we can leverage these insights to build stronger mentor-mentee relationships and to foster more inclusive and equitable communities.

Paper

HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU

Accelerators

Distributed Computing

Middleware and System Software

Performance Measurement, Modeling, and Tools

Post-Moore Computing

DescriptionThe end of Dennard scaling and the slowdown of Moore's law led to a shift in technology trends toward parallel architectures, particularly in HPC systems. To continue providing performance benefits, HPC should embrace Approximate Computing (AC), which trades application quality loss for improved performance. However, existing AC techniques have not been extensively applied and evaluated in state-of-the-art hardware architectures such as GPUs, the primary execution vehicle for HPC applications today.

This paper presents HPAC-Offload, a pragma-based programming model that extends OpenMP offload applications to support AC techniques, allowing portable approximations across different GPU architectures. We conduct a comprehensive performance analysis of HPAC-Offload across GPU-accelerated HPC applications, revealing that AC techniques can significantly accelerate HPC applications (1.64x LULESH on AMD, 1.57x NVIDIA) with minimal quality loss (0.1%). Our analysis offers deep insights into the performance of GPU-based AC that guide the future development of AC algorithms and systems for these architectures.

Posters

Research Posters

HPC Accelerated Generative Deep Learning Approach for Creating Digital Twins of Climate Models

XO/EX

DescriptionClimate models cannot perfectly represent the complex climate system, but by running them multiple times with small variations in input parameters, it's possible to estimate uncertainties and explore different climate scenarios. Generating these ensembles demands significant computational resources and time, which can be crucial for risk assessments and decision-making. This study utilizes generative adversarial networks (GANs) and deep diffusion models (DDMs) to produce low-resolution ensemble runs trained on data provided by climate model simulations with low computational expense. Additionally, convolutional neural networks (CNNs) are employed for downscaling as well as parallelization techniques to enhance performance and reduce computation time. This approach allows for time-efficient exploration of high-resolution ensemble members, facilitating climate modeling investigations that were previously challenging due to resource constraints.

Panel

HPC and Cloud Converged Computing: Merging Infrastructures and Communities

Artificial Intelligence/Machine Learning

Cloud Computing

Heterogeneous Computing

DescriptionThe end of Dennard scaling and tapering of Moore’s law has led to economic conditions that favor cloud hyperscalers. Consequently, cloud is projected to be the largest sector of computing by revenue by 2025. The tremendous growth translates into substantial investment in research and development to manage the complexity of emerging systems. Cloud technologies such as elasticity, containerization and orchestration, and automation are gaining prevalence in HPC due to their abilities to manage new composite scientific workflows. Similarly, HPC techniques for performance optimization, scheduling, and fine-grained resource management are being integrated into the cloud to improve performance. The trend of integrating technologies from each community into the other leads to Converged Computing, an environment that combines the best capabilities from both worlds. In this highly interactive panel, we invite experts from industry, national laboratories, and academia to discuss their experiences with converged computing and share their views on its future.

Workshop

HPC Bugs Fest Introduction

Applications

Software Engineering

Workshop

HPC Carpentry – A Scalable, Peer-Reviewed Training Pprogram to Democratize HPC Access

Education

State of the Practice

DescriptionThe HPC Carpentry lesson program is a highly interactive, hands-on approach to getting users up to speed on HPC cluster systems. It is motivated by the increasing availability of cluster resources to a wide range of user groups, many of whom come from communities that have not traditionally used HPC systems.

We adopt the Carpentries approach to pedagogy, which consists of a workshop setting where learners type along with instructors while working through the instructional steps, building up "muscle memory" of the tasks, further reinforced by challenge exercises at critical points within the lesson.

We review the development of the HPC Carpentry Lesson Program as it becomes the first entrant into phase 2 of The Carpentries Lesson Program Incubator. This incubator is the pathway for HPC Carpentry to become an official lesson program of The Carpentries.

Workshop

HPC Container Conformance

DescriptionWhile containerization revolutionized the delivery and execution of software, it introduces new challenges as the usual practice with one big software file-system with a subsequent module load to rule all environments is not feasible with containers. This lighting talk introduces the 'HPC Container Conformance' Project which aims to provide guidance on how to build container images and how to annotate them so that end-users and system admins can integrate them in their workflows. The talk also briefly introduces the MetaHub Registry, an OCI compliant container registry to serve environment/hardware specific images and reduce overall complexity of herding all the container images as a practical implementation of the HPC Container Compliance project.

Birds of a Feather

HPC Graph Toolkits and the GraphBLAS Forum

Algorithms

XO/EX

DescriptionGovernment agencies, industry and academia are demanding a new generation of tools to efficiently solve large scale analytics problems in a variety of business, scientific and national security applications. This BoF gathers the community developing high-performance frameworks and workflows for large scale graph analytics to survey current approaches, identify new challenges and opportunities, and discuss interoperability of emerging infrastructures. A central goal is developing requirements and recommendations for future tools. As in previous editions, this BoF will explore and compare and contrast conventional implementations as well as algebraic approaches, inviting the GraphBLAS community to discuss its state and evolution.

Birds of a Feather

HPC Next: The RISC-V Ecosystem

Architecture and Networks

XO/EX

DescriptionRISC-V is an open instruction set standard which is experiencing extraordinary growth and has the potential to revolutionize supercomputing. There are a growing number of RISC-V activities by the HPC community, and the goal of this BoF is to continue the discussion with the community about the RISC-V ecosystem and how it can best support HPC research and development. Beginning with a short overview on the status of the RISC-V HPC ecosystem, this will be followed by a Q&A with the panel and audience. There will be directed questions, as well as ad hoc questions, and discussions with the audience.

Workshop

HPC Software Scaling for ML Using CXL 3.0 GFAM

Distributed Computing

Middleware and System Software

Runtime Systems

DescriptionTraditional HPC systems rely on balanced soft scaling, which adjusts the compute-to-memory ratio according to the workload. However, this approach is challenged by Machine Learning applications, especially Large Language Model (LLM) workloads, which demand much more memory than compute. This leads to wasted compute resources and excessive data movement in the system. To address this issue, we propose to use CXL 3.0 Global Fabric Attached Memory (GFAM), which enables independent scaling of compute and memory and reduces data movement. In this talk, we will explore how GFAM architectures require changes in memory and compute placement, as well as software stacks, to optimize performance for LLM workloads.

Workshop

HPC-GPT: Integrating Large Language Model for High-Performance Computing

Artificial Intelligence/Machine Learning

Graph Algorithms and Frameworks

DescriptionLarge Language Models (LLMs), including the LLaMA model, have exhibited their efficacy across various general-domain natural language processing (NLP) tasks. However, their performance in high-performance computing (HPC) domain tasks has been less than optimal due to the specialized expertise required to interpret the model’s responses. In response to this challenge, we propose HPC-GPT, a novel LLaMA-based model that has been supervised fine-tuning using generated QA (Question-Answer) instances for the HPC domain. To evaluate its effectiveness, we concentrate on two HPC tasks: managing AI models and datasets for HPC, and data race detection. By employing HPC-GPT, we demonstrate comparable performance with existing methods on both tasks, exemplifying its excellence in HPC-related scenarios. Our experiments on open-source benchmarks yield extensive results, underscoring HPC-GPT’s potential to bridge the performance gap between LLMs and HPC-specific tasks.

Workshop

HPCSYSPROS 23 – Closing Remarks

State of the Practice

Workshop

HPCSYSPROS 23 – Morning Break

State of the Practice

Workshop

HPCSYSPROS 23 – Opening Remarks

State of the Practice

Workshop

HPCTESTS 2023 – Morning Break

Programming Frameworks and System Software

State of the Practice

Workshop

HPPSS – Concluding Remarks

Large Scale Systems

Programming Frameworks and System Software

Workshop

HPPSS – Introduction

Large Scale Systems

Programming Frameworks and System Software

Workshop

HPPSS – Invited Speaker

Large Scale Systems

Programming Frameworks and System Software

Workshop

HPPSS – Morning Break

Large Scale Systems

Programming Frameworks and System Software

Workshop

HPPSS – Panel Discussion

Large Scale Systems

Programming Frameworks and System Software

Workshop

HUST-23 – Conclusion

Programming Frameworks and System Software

Workshop

HUST-23 – Morning Break

Programming Frameworks and System Software

Workshop

HUST-23 Introduction

Programming Frameworks and System Software

Workshop

HUST-23: 10th International Workshop on HPC User Support Tools

Programming Frameworks and System Software

DescriptionThe HPC user suppport tools (HUST) workshop, has become a key forum to promote new and innovative user support tools such as XALT, Spack, Easybuild, and ReFrame to the HPC community. Many of the HPC user tools presented at earlier HUST workshops have matured to the point of becoming the community standard and are integral tools for the user support at HPC centers around the world. The HUST workshop is a forum for system administrators, user support members, tool developers, policy makers, and end users to learn about new and innovative tools. The HUST workshop central aim is as a publication venue for current and on-going support tool developments and to promote the uptake of these tools. Identify and support best practices, novel tools and novel ideas to help streamline user support efforts within the novel technology ecosystems at HPC centers. These issues are all in-scope for the HUST workshop.

Posters

Research Posters

Hybrid CPU-GPU Implementation of Edge-Connected Jaccard Similarity in Graph Datasets

XO/EX

DescriptionTypical GPU programs consist of four steps: (1) data preparation, (2) host CPU-to-GPU data transfers, (3) execution of one or more GPU kernels, and (4) transfer of results back to CPU. While the kernel is running on the GPU, the CPU cores often remain idle, waiting on the GPU to finish kernel execution.

In recent years, several frameworks have been presented that perform automated distribution of workload to both CPU and GPU. While the aforementioned frameworks offer techniques for CPU+GPU workload distribution for regular applications, identifying a performant CPU+GPU workload distribution for irregular applications remains a difficult problem due to workload imbalance and irregular memory access patterns.

This work evaluates a hybrid CPU+GPU implementation of an irregular workload -- graph link prediction using the Jaccard similarity. For the graphs that benefit the most from our hybrid CPU-GPU approach, our implementation delivers a 16.4-28.4% improvement over the state-of-the-art Jaccard similarity implementation.

Exhibitor Forum

Hybrid Quantum-HPC at LRZ

Exascale

Programming Frameworks and System Software

Quantum Computing

XO/EX

DescriptionAs quantum computing (QC) matures and scales, it’s placement and alignment as part of the high-performance computing (HPC) realm more clearly comes into focus. Qubit-based calculations pledge an additional acceleration capability and the possibility to address previously intractable computation science challenges. Upcoming quantum-enabled HPC systems are leveraging many best practices garnered from decades of development in supercomputing, including workflows, standards, and programming tools. At the same time, necessary divergences and augmentations are under development to bridge bit-qubit synergies including run-time compilation, long optimization times, statistical evaluations of results, hybrid scheduling and resource management, and the need to work with few centralized resources.

In this exhibitor presentation, the Leibniz Supercomputing Centre of the Bavarian Academy of Sciences will overview and highlight its multi-dimensional efforts to provide, merge and optimize various forms of quantum accelerators into its HPC systems. Our efforts drive the hybrid software development of the Munich Quantum Software Stack – the unifying software stack of the Munich Quantum Valley for its regionally-developed quantum modalities for superconducting materials, ions and atoms – and its mission to fold them into current and upcoming HPC systems. Our efforts expand from MQV through national efforts including Germany’s first quantum demonstrator named Q-Exa with the Finnish/German company IQM, currently being seated at LRZ. Additionally, with the upcoming placement of the EuroHPC Joint Undertaking superconducting system, Euro-Q-Exa, this effort expands to the European landscape and alignment to several HPC centers across the continent to forward quantum-HPC.

This talk describes our vision and research for an integrated ecosystem that combines existing HPC and evolving quantum software stacks into a single system to enable a common and continuous user experience for the benefit of next-generation science and industry results.

I Am HPC Plenary

I Am HPC: Impact and Future Directions of HPC

TUT

XO/EX

DescriptionThe panel focuses on the people and social impact of the HPC community. We explore some of the major impacts of HPC on scientific discovery and society as well as some of its future technical and applications directions. The discussion will be engaging and exciting, including many different perspectives on these issues.

High-performance computing has significantly impacted scientific discovery in many areas such as climate modeling, materials design, cosmology, biology, computing to name a few. It has significantly reduced the time to scientific discoveries via simulations that provide new insights and new directions for experiments. HPC is critical to the training and inference for AI, which is used in many areas including the humanities, life sciences, physical sciences, sociology, and more. Further, HPC has had societal impacts on our daily lives including faster drug design, addressing threats posed by COVID-19, and climate modeling for better predictions and the implications for proactive responses to mitigate and adapt to climate changes.

The collective impact of the people in the HPC community is significant, and the community is growing to bring about additional directions for impact.

Doctoral Showcase

Posters

I/O Efficient Machine Learning

Artificial Intelligence/Machine Learning

I/O and File Systems

DescriptionMy research focuses on systems optimizations for machine learning, specifically on I/O efficient model storage and retrieval.

The first part of my work focuses on efficient inference serving of tree ensemble models. Tree structures are inherently not cache friendly and their traversal incurs random I/Os. We developed two systems - Blockset (Block Aligned Serialized Trees) and T-REX (Tree Rectangles).

Blockset improves inference latency in the scenario where the model doesn’t fit in memory. It introduces the concept of selective access for tree ensembles in which only the parts of the model needed for inference are deserialized and loaded into memory. It uses principles from external memory algorithms to rearrange tree nodes in a block aligned format to minimize the number of I/Os needed for inference. T-REX optimizes inference latency for both in-memory inference as well as inference when the model doesn’t fit in memory. T-REX reformulates decision tree traversal as hyperrectangle enclosure queries using the fact that decision trees partition the space into convex hyperrectangles. The test points are then queried for enclosure inside the hyperrectangles. In doing random I/O is traded for additional computation.

The second part of my work focuses on efficient deep learning model storage. We implemented a deep learning model repository that requires fine-grained access to individual tensors in models. This is useful in applications such as transfer learning, where individual tensors in layers are transferred from one model to another. We’re currently working on caching and prefetching popular tensors based on application level hints.

Paper

I/O in WRF: A Case Study in Modern Parallel I/O Techniques

Data Analysis, Visualization, and Storage

I/O and File Systems

State of the Practice

DescriptionLarge-scale parallel applications can face significant I/O performance bottlenecks, making efficient I/O crucial. This work presents a comparative study of several parallel I/O implementations in the Weather Research and Forecasting model, including PnetCDF blocking and non-blocking I/O options, netCDF4, HDF5 Log VOL, and ADIOS. For I/O methods creating files in a canonical data layout, PnetCDF's non-blocking option offers up to 2x improvement over its blocking option and up to 4.5x over HDF5 via netCDF4, demonstrating the effectiveness of the write request aggregation technique. The HDF5 Log VOL outperforms ADIOS with a 4x improvement in write performance when creating files in the log layout, although both require non-negligible time to convert the file back to canonical order for post-run analysis. From these results, we extract some observations that can guide I/O strategies for modern parallel codes.

Workshop

IA^3 – Concluding Remarks

Algorithms

Applications

Architecture and Networks

Workshop

IA^3 – Invited Talk

Algorithms

Applications

Architecture and Networks

Workshop

IA^3 – Welcome and Introduction

Algorithms

Applications

Architecture and Networks

Workshop

IA^3 2023 – Afternoon Break

Algorithms

Applications

Architecture and Networks

Workshop

ICE 2.0: Restructuring and Growing an Instructional HPC Cluster

Resource Management

State of the Practice

DescriptionThe Partnership for an Advanced Computing Environment (PACE) at Georgia Tech (GT) has been running two campus-wide cluster resources available for academic courses and workshops for five years. The initial design focused on creating a federated resource for a wide range of educational topics, based on a PACE and College of Computing (COC) partnership. Due to funding, this took the form of separate resources, one funded by PACE, and another by COC. These "Instructional Cluster Environments", PACE-ICE and COC-ICE, became very popular with instructors at GT but led to a high maintenance cost due to the split nature of the environments. With the transition to the Slurm scheduler, PACE collaborated with COC to merge the two clusters into one, ICE. This work details the strategies used to sensibly merge the two production systems, including the storage architecture, shared system policies, and scheduler priority configurations that honor funding complexities.

Birds of a Feather

IEEE Quantum-HPC Working Group

Post-Moore Computing

Quantum Computing

XO/EX

DescriptionThe IEEE Quantum-HPC Working Group BoF is the second community-building session targeting academic and enterprise stakeholders in HPC and hybrid HPC-QCS (Quantum Computing and Simulation). Launched at IEEE Quantum Week 2023, the Quantum-HPC Working Group addresses the challenges and opportunities of interfacing HPC and QCS through a full-stack approach across infrastructure, system software, programming tools and use cases. The BoF brings together attendees who are interested in the role of QCS in the HPC ecosystem to chart a sustainable path forward to interface the two technologies by collaborating on the focus areas and technical working structure of Quantum-HPC.

Panel

Immersion Cooling: 3 Considerations You Should Care about and Real-World Deployment Experiences

Artificial Intelligence/Machine Learning

Energy Efficiency

Hardware Technologies

TUT

XO/EX

DescriptionAs the world increasingly relies on a new era of high-wattage CPU and GPU platforms to deliver HPC and AI breakthroughs, deploying and cooling these systems within traditional data centers presents a problem. This panel will discuss the need to create more sustainable deployment environments and how immersion cooling is a critical piece to this puzzle.

Join expert panelists in a discussion about the following three considerations: 1) Cost, 2) Available Options, and 3) Service/Support/Warranty Implications. The panel will discuss why out with the old (cooling with fans) and in with the immersive (immersing HPC and AI servers into high-tech non-conductive fluid) is a sustainable option for the future of modern data centers. Most people can agree on one thing – driving rack density and cooling of high-wattage processors presents a new set of challenges. The good news? We have options! Join our SC23 panel to learn more.

Workshop

Implementation-Oblivious Transparent Checkpoint-Restart for MPI

Fault Handling and Tolerance

DescriptionThis work presents experience with traditional use cases of checkpointing on a novel platform. A single codebase (MANA) transparently checkpoints production workloads for major, available MPI implementations: "develop once, run everywhere''. The new platform allows application developers to compile their application against any of the available standards-compliant MPI implementations, and test each MPI implementation according to performance or other features.

Since its original academic prototype, MANA has been under development for three of the past four years, and is planned to enter full production at NERSC in early Fall of 2023. To the best of the authors' knowledge, MANA is currently the only production-capable, system-level checkpointing package running on a large supercomputer (Perlmutter at NERSC) using a major MPI implementation (HPE Cray MPI). Experiments are presented on several large production workloads, showing low runtime overhead with one codebase supporting four MPI implementations: HPE Cray MPI, MPICH, Open MPI, and ExaMPI.

Workshop

Implementing Scalable Matrix-Vector Products for the Exact Diagonalization Methods in Quantum Many-Body Physics

Algorithms

Applications

Distributed Computing

Compilers

Heterogeneous Computing

Message Passing

Programming Frameworks and System Software

Quantum Computing

Task Parallelism

Tensors

DescriptionExact diagonalization is a well-established method for simulating small quantum systems. Its applicability is limited by the exponential growth of the Hamiltonian matrix that needs to be diagonalized. Physical symmetries are usually utilized to reduce the matrix dimension, and distributed-memory parallelism is employed to explore larger systems. This paper focuses on an implementation of the core distributed algorithms, with a special emphasis on the matrix-vector product. Instead of the conventional MPI+X paradigm, Chapel is chosen as the language in this work.

We provide a comprehensive description of the algorithms and present performance and scalability tests. Our implementation outperforms the state-of-the-art MPI-based solution by a factor of 7--8 on 32 compute nodes or 4096 cores and scales well through 256 nodes or 32768 cores. The implementation has 3 times fewer software lines of code than the current state of the art, but is still able to handle generic Hamiltonians.

Birds of a Feather

Implementing Zero Trust on HPC

Architecture and Networks

XO/EX

DescriptionZero-Trust is the cybersecurity architecture of choice and is now being discussed in supercomputing environments. Zero-Trust is based on a least-privilege per-request approach - and it has serious implications for HPC centers, application developers, and end-user workflows. Join this discussion with US Federal CIOs to discuss their expectations and with HPC leaders on their approach.

Workshop

Improve and Stabilize Classification Results of DataRaceBench

Applications

Software Engineering

DescriptionDataRaceBench is a benchmark using small kernel applications to classify the detection capabilities of data race detection tools. During our experiments of applying Archer to the benchmark suite we observed different short-comings. With recently added kernels, the turn-around time of a basic benchmark run increased from several minutes to more than an hour. Furthermore, we observed non-deterministic and unexpected results. In this presentation, we propose several changes to existing kernels to address these short-comings. In addition, we propose to use variants of the kernels with non-deterministic runtime schedules that explicitly enforce these different schedules. Finally, we provide an evaluation of the updated benchmark with Archer running in thread-centric and task-centric mode.

Posters

Research Posters

Improving Memory Interfacing in HLS-Generated Accelerators with Custom Caches

Programming Frameworks and System Software

DescriptionAccelerators based on reconfigurable devices are becoming popular for data analytics in high performance computing and cloud computing systems. However, designing these accelerators is a hard problem. High-Level Synthesis tools can help by generating RTL designs from high-level languages, but they tend to optimize the computational part of the kernel, often not considering data movement and memory accesses. For many applications, instead, memory operations take a significant part of the overall execution time and can be the actual bottleneck limiting performance, especially when accessing large, possibly remote, memories.

We propose an approach based on the generation and integration of highly-customizable accelerator caches in order to reduce the latency with which an HLS-generated accelerator accesses external memory through spatial and temporal locality. We integrate it in a state-of-the-art open-source HLS tool and show how our approach allows to easily explore tradeoffs between performance and resource utilization with minimal user effort required.

Tutorial

In-Situ Analysis and Visualization with Ascent and ParaView Catalyst

Data Analysis, Visualization, and Storage

I/O and File Systems

TUT

DescriptionScientific visualization and analysis are key ingredients in HPC simulation workflows. For decades, the dominant paradigm has been post-hoc visualization; simulation codes iterate and save files to disk, giving the domain scientists the opportunity to read the data back at a later time for analysis. In recent years though, this paradigm has been stressed by an ever-diverging rate of growth between I/O and compute speeds. In-situ processing helps mitigate these I/O bottlenecks, enabling simulation and visualization calculations to run in-memory, at higher spatial and temporal resolution, avoiding the transfer of raw data to disks. Even in cases where I/O bottlenecks do not dominate, in-situ processing is well suited for batch-focused analysis, allowing simulation users to obtain distilled results without additional workflow steps.

This half-day tutorial introduces the in-situ visualization paradigm along with Ascent and ParaView Catalyst, two open-source in-situ processing libraries. Both libraries leverage a common interface, Conduit, which provides an intuitive model for describing hierarchical scientific data in C++, C, Fortran, and Python. Attendees will gain hands-on experience learning how to describe simulation data with Conduit and how to use Ascent and Catalyst to transform data, render images, and export results.

Birds of a Feather

Increasing Memory Utilization and Reducing Total Memory Cost Using CXL

Artificial Intelligence/Machine Learning

XO/EX

DescriptionCXL’s advanced memory expansion and fabric management capabilities can be used to increase system scalability and flexibility across multiple compute domains, enabling resource sharing for higher performance, reduced software stack complexity, and lower overall datacenter memory cost. The fabric enhancements and memory expansion features included in CXL 3.0 deliver new levels of composability required by the large models used in HPC and AI in the modern datacenter. Expert representatives from CXL Consortium member companies who are implementing the specification will explore the CXL 3.0 features, new use case enablement, and ROI examples when implementing CXL attached memory.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Incremental Graph Clustering in Parallel

XO/EX

DescriptionWe develop a distributed memory graph clustering algorithm to find clusters in a graph where new nodes and edges are being added incrementally. At each stage of the algorithm, we maintain a summary of the clustered graph computed from all incremental batches received thus far. As we receive a new batch of nodes and edges, we cluster the new graph and merge new clusters with the previous summary clusters. We use sparse linear algebra to perform these operations. Our algorithm would make it possible to find clusters in very large graphs for which regular graph clustering algorithms could not run due to computation/communication bottlenecks.

Exhibits

SCinet

INDIS - Architecure of SCinet

TUT

XO/EX

Exhibits

SCinet

INDIS - FABRIC (UNC, FABRIC)

TUT

XO/EX

Exhibits

SCinet

INDIS - Lightning and demo talks from the testbeds

TUT

XO/EX

Exhibits

SCinet

INDIS - Network innovations & industry (Google)

TUT

XO/EX

Exhibits

SCinet

INDIS - NRP intro (SDSC, NRP)

TUT

XO/EX

Exhibits

SCinet

INDIS - Panel on testbeds & future of doing science & SCinet

TUT

XO/EX

Exhibits

SCinet

INDIS - Quantum testbeds (ORNL)

TUT

XO/EX

Exhibits

SCinet

INDIS - The SLICES infrastructure (UvA, EU SLICES-RI)

TUT

XO/EX

Exhibits

SCinet

INDIS - Town Hall, future and close of session

TUT

XO/EX

Exhibits

SCinet

INDIS - Welcome INDIS Techology & Innovation session on testbeds INDIS/SCinet)

TUT

XO/EX

Workshop

INDIS and SCinet Introduction

Architecture and Networks

Workshop

INDIS Esteemed Guest Talk: Professor Eylem Ekici (Ohio State University)

Architecture and Networks

Workshop

INDIS Paper 1: Enhancing perfSONAR Measurement Capabilities Using P4 Programmable Data Planes

Architecture and Networks

Workshop

INDIS Paper 2: Experimental Study of TCP Throughput Profiles and Dynamics Over Dedicated Connections

Architecture and Networks

Workshop

INDIS Paper 3: Elephants Sharing the Highway – Studying TCP Fairness in Large Transfers Over High Throughput Links

Architecture and Networks

Workshop

INDIS Paper 4: Evaluation of SCION for User-Driven Path Control – A Usability Study

Architecture and Networks

Workshop

INDIS Paper 5: Throughput Optimization with a NUMA-Aware Runtime System for Efficient Scientific Data Streaming

Architecture and Networks

Workshop

Information Entropy-Based Camera Focus Point and Zoom Level Adjustment for Smart In-Situ Visualization

Data Analysis, Visualization, and Storage

Large Scale Systems

Performance Measurement, Modeling, and Tools

DescriptionIn-situ processing has widely been recognized as an effective approach for the visualization and analysis of large-scale simulation outputs from modern HPC systems. However, traditional batch-based in-situ visualization can produce large amounts of rendering results for post-hoc visual analysis, which can make it difficult to gain rapid insight into the simulation results during post-hoc visual analysis. As an alternative to accelerate the process of obtaining scientific knowledge, we have worked on a smart visualization approach, focusing on extracting a set of images that may facilitate the rapid understanding of the underlying simulated phenomena. In this work, we present a method for automatically adjusting the camera focus point and zoom level during in-situ visualization. We integrated the proposed approach with the existing in-situ smooth camera path estimation method for evaluation purposes and used two CFD simulation codes and two HPC systems (x86 Server and Arm-based Fugaku supercomputer) for the evaluations.

Workshop

Infrastructure for Writing Fork-Join Tests

Education

Heterogeneous Computing

Reproducibility

State of the Practice

DescriptionWe have developed a software infrastructure for testing multi-threaded programs that implement the fork-join concurrency model. The infrastructure employs several key ideas: The student solutions use print statements to trace the execution of the fork-join phases. The test writer provides a high-level specification of the problem-specific aspects of the traces, which is used by the infrastructure to handle the problem-independent and low-level details of processing the traces. During performance testing, trace output is disabled automatically. During functionality testing, fine-grained feedback is provided to identify the correct and incorrect implementation of the various fork-join phases. Tests written using our infrastructure have been used in an instructor-training workshop as an instructor agent clarifying requirements and checking in-progress work. The size of the code to check the concurrency correctness of final and intermediate results was far smaller than the code to check the serial correctness of such results.

Birds of a Feather

Integrating Cloud Infrastructure with Large Scale HPC Environments

Cloud Computing

XO/EX

DescriptionAs cloud environments deploy HPC capable infrastructure, large scale supercomputing and HPC centers are exploring how to integrate these resources into their ecosystems. This BoF will provide an opportunity for these centers to share their experiences and insights as well as provide a venue to establish collaborative efforts and develop broader strategies across the community. This BoF will provide a forum for discussion between supercomputing facility operators, cloud service providers, and the user community that will cover strategies and approaches for integrating cloud resources into existing HPC facility environments.

Posters

Research Posters

Integrating TEZIP into LibPressio: A Case Study of Integrating a Dynamic Application into a Static C Environment

XO/EX

DescriptionLCLS-II at SLAC, SNS at Oak Ridge Laboratory, and other instruments use software written in C and C++, producing huge volumes of time evolving data at high rate. Data compression can decrease the volume of data we need to move and store. TEZIP is a neural network (NN) based compressor designed for high-quality compression of time-evolving data. However, TEZIP is written in Python and is not easily usable from or ported to C++. In this work, we develop new components in LibPressio that allow us to integrate with TEZIP and other external compressors efficiently and evaluate them with a systematic approach. We find that TEZIP’s compression ratio (Error Bound 1e-06) for Hurricane Isabel is 128, which is 2.4 times greater than the leading SZ3’s, 52.8. Our basic integration of TEZIP into Libpressio sets a precedent for the integration of non C/C++ compressors into LibPressio.

Birds of a Feather

Interactive and Urgent HPC

State of the Practice

XO/EX

DescriptionMany HPC systems are managed using batch queues; however, not all HPC applications and workflows are best served by batch queue systems. Interactive prototyping, urgent streaming data analysis, application steering, and in-situ visualization are among the workflows that require interactive and urgent capabilities to be effective. After three successful SC BoFs and seven successful workshops at SC and ISC, the interactive and urgent HPC community is writing a position paper during the summer of 2023 to document progress and cast future research foci. In this BoF, we will present the state of the draft paper and solicit discussion and feedback.

Doctoral Showcase

Posters

Interactive In-Situ Visualization of Large Distributed Volume Data

Data Analysis, Visualization, and Storage

DescriptionLarge distributed volume data are routinely produced in numerical simulations and experiments. In-situ visualization, the visualization of simulation or experiment data as it is generated, enables simulation steering and experiment control, which helps scientists gain an intuitive understanding of the studied phenomena. Such data exploration requires interactive visualization with smooth viewpoint changes and zooming to convey depth perception and spatial understanding. As data sizes increase, this becomes increasingly challenging.

This thesis presents an end-to-end solution for interactive in-situ visualization on distributed computers based on novel extensions to the Volumetric Depth Image (VDI) representation. VDIs are view-dependent, compact representations of volume data that can be rendered faster than the original data.

We propose the first algorithm to generate VDIs on distributed 3D data, using sort-last parallel compositing to scale to large data sizes. Scalability is achieved by a novel compact in-memory representation of VDIs that exploits sparsity and optimizes performance. We also propose a low-latency architecture for sharing data and hardware resources with a running simulation. The resulting VDI is streamed for remote interactive visualization.

We provide a novel raycasting algorithm for rendering streamed VDIs, significantly outperforming existing solutions. We exploit properties of perspective projection to minimize calculations in the GPU kernel and leverage spatial smoothness in the data to minimize memory accesses.

The quality and performance of the approach are evaluated on multiple datasets, showing that the approach outperforms state-of-the-art techniques for visualizing large distributed volume data. The contributions are implemented as extensions to established open-source tools.

Paper

Interference-Aware Multiplexing for Deep Learning in GPU Clusters: A Middleware Approach

Accelerators

Distributed Computing

Middleware and System Software

Performance Measurement, Modeling, and Tools

Post-Moore Computing

DescriptionA common strategy for improving efficiency in training deep learning entails multiplexing tasks on a single GPU. To mitigate the interference caused by multiplexing, existing approaches primarily employ kernel-level solutions to regulate GPU kernel execution, or harness hardware-level techniques to explicitly restrict GPU streaming multiprocessors and memory. Nevertheless, none of them perform satisfactorily in optimizing the completion time of tasks.

In this paper, we present IADeep, a middleware solution designed to significantly improve multiplexing efficiency. The core concept is the co-optimization of task assignments within a cluster and interference mitigation on each device. IADeep coordinates the configuration of all co-located tasks in a less fine-grained fashion, effectively reducing interference and enhancing task training performance. Across the entire cluster, IADeep intelligently selects applications suitable for multiplexing to further amplify the advantages of optimizing task configurations. Evaluations on a 20 RTX 3090-GPU cluster demonstrate that IADeep can significantly outperform state-of-the-art multiplexing solutions.

Workshop

International RSE Collaboration with the Institute of Computing for Climate Science and the Virtual Earth System Research Institute

Software Engineering

Workshop

Intro to HPC Bootcamp: Engaging New Communities through Energy Justice Projects

Education

State of the Practice

DescriptionThe US Department of Energy is a long-standing leader in HPC for science. However, we face daunting challenges in fostering a robust and diverse HPC workforce. Basic HPC is not typically taught at early stages of students’ academic careers, and the capacity and knowledge of HPC at many institutions are limited. Even so, such topics are prerequisites for advanced training programs, internships, graduate school, and ultimately for careers in HPC. To help address this challenge, we launched a training and workforce pipeline program.

We describe the Intro to HPC Bootcamp, an immersive program designed to engage students from underrepresented groups as they learn foundational HPC skills. The program takes a novel approach to HPC training by turning the traditional curriculum upside down. Instead of focusing on technology and its applications, the bootcamp focuses on energy justice to motivate the training of HPC skills through project-based pedagogy and real-life science stories.

Birds of a Feather

Introducing MPI 4.1, the Newest Version of the Message Passing Interface Standard

Programming Frameworks and System Software

XO/EX

DescriptionThe Message Passing Interface (MPI) API is the most dominant programming approach for HPC environments. Its specification is driven by the MPI forum, an open forum consisting of MPI developers, vendors and users. Just before SC23, the MPI forum published the latest version of the standard, MPI 4.1. We will take a look at the new features and will discuss what this means for the user of MPI. However, MPI 4.1 is not the end of the MPI standard – the forum is already working toward MPI 5.0 and we will discuss ideas, directions and will feedback from the community.

Workshop

Introducing Open OnDemand to Supercomputer Fugaku

Programming Frameworks and System Software

DescriptionOne of the issues with HPC clusters is that the prerequisite knowledge required to use them is large, making the learning cost high for novice users. Moreover, it is desirable to run graphical user interface applications with interactive operations on the compute nodes, but the procedure is complicated.

We describe how we introduced Open OnDemand, a web portal that enables easy use of the computing resources of an HPC cluster, to Fugaku, a Japanese flagship supercomputer. To introduce the resources to new users, we developed an adapter that enables the job scheduler used in Fugaku to be used from Open OnDemand. In addition, to further improve user convenience, we developed applications that enable data sharing between Open OnDemand and external storages. This paper describes the various features we have given to Open OnDemand for Fugaku and the development of the data sharing applications.

Posters

Research Posters

Introducing Prefetching and Data Compression to Accelerate Checkpointing for Inverse Seismic Problems

XO/EX

DescriptionRemote Time Migration (RTM) poses substantial computational challenges, demanding large memory and extended processing times. Our RTM implementation processes three-dimensional fields on multiple NVIDIA GPUs using the Revolve algorithm for checkpointing. However, transferring data between the host and GPU memory introduces a bottleneck.

We introduced a checkpoint prefetching mechanism to overcome this, anticipating memory transfers from host to GPU. Additionally, we integrated GPU data compression using the cuZFP library to reduce data transfer sizes further. The experimental results demonstrated significant performance improvements, achieving a speedup of 1.98x - 2.53x in our benchmark dataset. Prefetching + compression techniques together could reduce host-to-GPU memory transfers by up to 16x.

Workshop

Introduction and Welcome

Architecture and Networks

Hardware Technologies

Tutorial

Introduction to High-Performance Parallel Distributed Computing Using Chapel, UPC++, and Coarray Fortran

Distributed Computing

Software Engineering

TUT

DescriptionA majority of HPC system users utilize scripting languages such as Python to prototype their computations, coordinate their large executions, and analyze the data resulting from their computations. Python is great for these many uses, but it frequently falls short when significantly scaling up the amount of data and computation, as required to fully leverage HPC system resources. In this tutorial, we show how example computations such as heat diffusion, k-mer counting, file processing, and distributed maps can be written to efficiently leverage distributed computing resources in the Chapel, UPC++, and Fortran parallel programming models.

The tutorial is targeted for users with little-to-no parallel programming experience, but everyone is welcome. A partial differential equation example will be demonstrated in all three programming models. That example and others will be provided to attendees in a virtual environment. Attendees will be shown how to compile and run these programming examples, and the virtual environment will remain available to attendees throughout the conference, along with Slack-based interactive tech support.

Come join us to learn about some productive and performant parallel programming models!

Tutorial

Introduction to Quantum Computing

Algorithms

Post-Moore Computing

Quantum Computing

TUT

DescriptionQuantum computing offers the potential to revolutionize high-performance computing by providing a means to solve certain computational problems faster than any classical computer. Relatively recently, quantum computing has advanced from a theoretical possibility to engineered reality, with commercial entities offering early prototype quantum processors representing a variety of qubit technologies and computational paradigms. The media have been showcasing each new development and implicitly conveying the message that quantum-computing ubiquity is nigh. Here, we will respond to this hype and provide an overview of the exciting but still early state of the field.

We introduce participants to the computational models underlying quantum computing. We work through examples of its immense computational power while highlighting what the quantum computing community still does not know in terms of quantum algorithms and where the power of quantum computing comes from. We examine the thought processes that programmers use to map problems to circuit-model quantum computers, quantum annealers, measurement-based quantum systems, analog Rydberg atom arrays, and other recent inventions in the quantum-computing space. We conclude with an overview of the hardware and algorithmic challenges that must be overcome before quantum computing becomes a component of the HPC developer's repertoire.

Workshop

Introduction to The 6th Annual Parallel Applications Workshop, Alternatives to MPI+X

Accelerators

Artificial Intelligence/Machine Learning

Applications

Distributed Computing

Compilers

Exascale

Heterogeneous Computing

Message Passing

Performance Optimization

Programming Frameworks and System Software

Software Engineering

Sustainability

Task Parallelism

DescriptionAs supercomputers become more and more powerful, the number and diversity of applications that can be tackled with these machines grows. Unfortunately, the architectural complexity of these supercomputers grows as well, with heterogeneous processors, multiple levels of memory hierarchy, and many ways to move data and synchronize between processors. The MPI+X programming model, use of which is considered by many to be standard practice, demands that a programmer be expert in both the application domain and the low-level details of the architecture(s) on which that application will be deployed, and the availability of such superhuman programmers is a critical bottleneck. Things become more complicated when evolution and change in the underlying architecture translates into significant re-engineering of the MPI+X code to maintain performance.

Numerous alternatives to the MPI+X model exist, and by raising the level of abstraction on the application domain and/or the target architecture, they offer the ability for “mere mortal” programmers to take advantage of the supercomputing resources that are available to advance science and tackle urgent real-world problems. However, compared to the MPI+X approach, these alternatives generally lack two things. First, they aren’t as well known as MPI+X and a domain scientist may simply not be aware of models that are a good fit to their domain. Second, they are less mature than MPI+X and likely have more functionality or performance “potholes” that need only be identified to be addressed.

PAW-ATM is a forum for discussing HPC applications written in alternatives to MPI+X. Its goal is to bring together application experts and proponents of high-level languages to present concrete example uses of such alternatives, describing their benefits and challenges.

Posters

Research Posters

Investigating Anomalies in Compute Clusters: An Unsupervised Learning Approach

XO/EX

DescriptionAs compute clusters used for running batch jobs continue to grow in scale and complexity, the frequency of anomalies significantly increases. Timely detection of anomalous events has become vital to maintain system efficiency and availability. Our study presents an attention-based graph neural network (GNN) to detect anomalies in clusters at the compute node level and provide detailed root cause analysis to pinpoint issues. Evaluating on real-world datasets, attention-based GNN shows its ability to accurately detect and localize anomalies.

Workshop

Investigating Linear Solvers for Power Grid Analysis with Exascale Computing: A Journey of Learning and Collaboration

State of the Practice

DescriptionThis work is a contribution to the advancement of linear solvers for the Exascale Computing Project. It focuses on direct sparse linear solvers using High-Performance Computing (HPC) for large-scale power systems, resembling the United States power grids. This paper explores supercomputers at Oak Ridge Leadership Computing Facility, Summit and Frontier, comparing both performance and optimization strategies. The project encompasses a comprehensive test bench for Trilinos Amesos2 CPU-based solvers, KLU2 and ShyLUBasker, and the testing of GPU-based solvers from NVIDIA cuSolver to AMD rocSolver on distinct architecture configurations. The challenges of power flow analysis are addressed through optimization techniques, like matrix symmetry and GPU acceleration, and by evaluating accuracy and stability of linear solvers through residual analysis. Beyond technical gains, this work underscores the significance of collaboration and diverse expertise in HPC for innovative analysis of power grid systems, critical for resilient infrastructure against burgeoning threats like climate change and cyberattacks.

Workshop

Investigating the Real-World Applicability of MPI Correctness Benchmarks

Applications

Software Engineering

DescriptionThe MPI correctness benchmarks MPI-Corrbench and the MPI Bugs Initiative contain standardized test cases of correct and erroneous use of MPI, allowing MPI correctness tool developers to assess their tools performance and guide further development of their checking capabilities. Hence, the correctness benchmarks should encompass representative MPI (mis-)usage that mirrors real-world codes. To that end, we analyze the MPI usage of these correctness benchmarks at argument granularity and compare it to a previously collected data set of 96 HPC codes. This assessment measures the benchmarks' proximity to real-world MPI usage patterns and offers insights for enhancing their coverage.

Workshop

Invited Talk

Data Analysis, Visualization, and Storage

Data Movement and Memory

Workshop

Invited Talk 1: The Legacy of ECP Software Efforts, Realized, and to Come

Algorithms

Heterogeneous Computing

Large Scale Systems

DescriptionThe US Department of Energy (DOE) Exascale Computing Project (ECP) is coming to an end. But the impact of the project is just beginning. ECP has produced dozens of GPU-enabled, scalable application codes and dozens of GPU-capable libraries and tools that underpin these applications. The experiences and software capabilities coming out of ECP are ready for further leveraging within DOE and beyond. ECP outcomes demonstrate that the opportunity for impact on science and engineering computations is exceptional. Across the board, ECP applications realized improvements of 100 times or more in performance and scalability, leading to similar orders of scientific impact. Furthermore, this impact comes primarily from adapting algorithms and software to exploit GPU devices from NVIDIA, AMD, and Intel through performance portability layers. While ECP focused on expanding capabilities for large systems, the same technical advances are directly translatable to migrating existing applications to smaller systems. For example, a problem that presently requires a CPU-based cluster can realize similar performance on desktop systems leveraging GPUs, greatly reduce energy and infrastructure costs. As the energy costs of HPC projects become an increasing concern, accelerated (GPU) devices become the path to increased efficiency but only if our HPC software stacks can deliver the performance potential. The true legacy of ECP will be how it advanced the development of application codes and established a stack of libraries and tools to be leveraged by the broader HPC community. ECP provides a foundation for transforming HPC applications to realize their performance potential on GPUs and future accelerated devices.

Workshop

Invited Talk 2: Living in a Heterogenous World – How Scientific Workflows Bridge Diverse Cyberinfrastructure and What Can We Do Better?

Algorithms

Heterogeneous Computing

Large Scale Systems

DescriptionScientific workflows are now a common tool used by domain scientists in a number of disciplines. They are appealing because they enable users to think at high level of abstraction, composing complex applications from individual application components. Workflow management systems (WMSs), such as Pegasus (http://pegasus.isi.edu) automate the process of executing these workflows on modern cyberinfrastructure. They take these high-level, resource-independent descriptions and map them onto the available heterogeneous resources: campus clusters, high-performance computing resources, high-throughput resources, clouds, and the edge. WMSs can select the appropriate resources based on their architecture, availability of key software, performance, reliability, availability of cycles, storage space, among others. Using algorithms like those used in compilers, they can determine what data to save during execution, and which are no longer needed. Similarly to compiler solutions, they can generate an executable workflow that is tailored to the target execution environment, taking into account reliability, scalability, and performance. WMS use workflow execution engines to run the executable workflows on the target resources, while the jobs within the workflow are managed by the host runtime system. This talk will describe the key concepts used in the Pegasus WMS and pose the question of how to improve workflow management systems to be more dynamic and resilient.

Workshop

Invited Talk 3: The Pursuit of the Brain’s Ubiquitous Stochasticity

Algorithms

Heterogeneous Computing

Large Scale Systems

DescriptionOne of the most dramatic differences between the brain and modern computing systems is the ubiquitous stochasticity of neural circuits. The brain leverages noise in its biophysics to make its computations more powerful and efficient, whereas today’s computers are designed, at great expense, to be deterministic from the transistor up. Such determinism is assumed to be necessary for microelectronics, but it leads to high costs in both fabrication and in the design of probabilistic computing applications.

In this talk, I will describe how modern neuromorphic computing are approaching this level of widespread stochasticity, enabling the development of a new class of probabilistic neuromorphic applications. The talk will highlight our results on implementing Monte Carlo random walk applications and stochastic optimization on the Intel Loihi, SpiNNaker, and IBM TrueNorth systems, showing how today’s neuromorphic systems are increasingly competitive with CPUs and GPUs. Finally, I will describe how future neuromorphic systems that leverage true random number generation from stochastic “coinflip” devices may prove critical for realizing the full potential for neuromorphic computing for scientific applications.

Workshop

Invited Talk 4: Innovative Supercomputing by Integrations of Simulations/Data/Learning on Oakforest-PACS II

Algorithms

Heterogeneous Computing

Large Scale Systems

DescriptionRecently, supercomputing has been changing dramatically. Integration/convergence of Simulation/Data/Learning (S+D+L) is important towards Society 5.0 proposed by Japanese Government, which enables integration of cyber space and physical space. In 2015, we started the BDEC project (Big Data & Extreme Computing) for development of supercomputers and software for integration of (S+D+L). In May 2021, we started operation of the Wisteria/BDEC-01. It is the first BDEC system, which consists of computing nodes for computational science and engineering with A64FX (Odyssey), and those for Data Analytics/AI with NVIDIA A100 GPU’s (Aquarius). We also develop a software platform “h3-Open-BDEC” for integration of (S+D+L) on the Wisteria/BDEC-01, which is designed for extracting the maximum performance of the supercomputers with minimum energy consumption focusing on (1) Innovative method for numerical analysis by adaptive precision, accuracy verification and automatic tuning, (2) Hierarchical Data Driven Approach based on machine learning, and (3) Software for heterogeneous systems. Integration of (S+D+L) by h3-Open-BDEC enables significant reduction of computations and power consumption, compared to those by conventional simulations. In January 2025, we will start to operate the Oakforest-PACS II system (OFP-II) together with University of Tsukuba. OFP-II will consist of NVIDIA H100 nodes with a total peak performance of 100+ PFLOPS. This is our next platform for integration of (S+D+L). Since October 2022, we started supports for our users to migrate their applications to the OFP-II with GPUs under collaboration with NVIDIA. In this talk, our activities in integration of (S+D+L) and efforts towards OFP-II will be described.

Workshop

Invited Talk 5: Building Quantum Machine Learning for Real-World Applications

Algorithms

Heterogeneous Computing

Large Scale Systems

DescriptionQuantum machine learning is a rapidly growing field of quantum computing, and many deep learning models and methods have been adapted into quantum analogues using gate-based or annealing-based platforms. These methods have been essential for uncovering subtleties in quantum learning dynamics, and there are a growing number of examples that can be found in the literature, implemented in simulated, or actual quantum hardware. The maturity of quantum technology presents opportunities for building, and training larger quantum machine models. But with increasing circuit depth and width, when working with real-world, classical datasets, the field still faces several obstacles, namely, how to pre-process data efficiently and effectively for quantum machine learning, and how to post-process the outcomes of measurements.

In this talk, I will present an overview of several research projects that are ongoing at Oak Ridge National Laboratory in the fields of high energy physics and natural language processing. I will highlight and discuss the challenges, and advantages we have encountered when building, training and deploying quantum generative models, quantum natural language processing models, and quantum classifiers, either as standalone models or as components of hybrid workflows.

Workshop

Invited Talk: Information Security Controls Prioritization – SABSA for HPC

Distributed Computing

Security

DescriptionThe SABSA (Sherwood Applied Business Security Architecture) model is a useful generic means of exploring users’ preferences for reducing residual risks to acceptable levels given budgetary (financial, resource, time frames etc.) constraints while traceably supporting business objectives.

This talk presents why and how SABSA can be used in the HPC context to optimize selection of controls to address mandatory (e.g. pursuant to USA's National Strategic Computing Initiative establishment by Presidential Executive Order 13702) and discretionary security requirements.

Workshop

Invited Talk: MareNostrum5 – Access and User Support to This New Highly Heterogeneous System

Accelerators

Compilers

Heterogeneous Computing

Programming Frameworks and System Software

Runtime Systems

DescriptionFollowing the joint effort by the Spanish, Portuguese and Turkish governments, together with EuroHPC JU (EC), the new supercomputer MareNostrum 5 will entry in operation in the following weeks. This highly heterogeneous supercomputer, with an aggregated peak performance above 300 PFlop/s, will include a world-class Accelerated Partition based on NVIDIA Hopper cards and the largest x86-General Purpose partition in the world, as well as the new NVIDIA GRACE CPU. In order to fully exploit this new research infrastructure, the investment includes a strict access mechanism based on scientific excellence criteria and a complete user support program.

Workshop

Invited Talk: Practical Machine Learning on Biological Knowledge Graphs

Artificial Intelligence/Machine Learning

Graph Algorithms and Frameworks

Workshop

Invited Talk: Thoughts on Security for CXL-3.x-GFAM Clusters with Embedded Computing

Distributed Computing

Security

DescriptionMulti-host clusters built with CXL memory modules (enabled by the CXL 3.0 standard) provide a for an opportunity for power efficient computing. Enabling near data computing is not without it security challenges. This presentation identifies several security issues worthy of architectural considerations and approaches for mitigating these issues.

Workshop

Invited Talk: When Optimizing Software Produces Optimized Hardware – A Case for Statically-Interpretable Control-Flow Programs

Accelerators

Codesign

Heterogeneous Computing

Task Parallelism

DescriptionNowadays, powerful optimizing compilers are needed to transform and specialize software for a particular machine, for performance and energy considerations. For example, compilers for High-level synthesis (HLS) can greatly facilitate the description of complex hardware implementations, by raising the level of abstraction to a classical imperative language such as C/C++, usually augmented with vendor-specific pragmas and APIs. Software is being used to describe hardware, but despite productivity improvements attaining high performance for the final designs remains a challenge: many crucial optimizations require substantial changes in control-flow structure, I/O approach, on-chip buffer management, function boundaries, exposing concurrency, etc.

In this talk, we discuss techniques and tools to assist with the development of optimized software, and optimized hardware using HLS. By specializing the compilation process to a specific class of programs, those whose control flow and dataflow can be exactly computed by means of interpretation at compile-time (e.g., many deep learning applications), we can for instance develop advanced code generation techniques for a class of sparse computation, powerful source-to-source transformations for better hardware designs via HLS, and verify the correctness of these optimized programs automatically.

Workshop

Invited Talk: I/O Profiling and Benchmarking for AI Applications

Accelerators

Codesign

Heterogeneous Computing

Task Parallelism

DescriptionTraining artificial intelligence (AI) models involves repeatedly loading large amounts of datasets. Data loading and transferring can potentially become one of the bottlenecks. AI training has different Input/Output (I/O) patterns compared with traditional scientific simulations. It is read intensive, involving heavy metadata operations, complex data formats, random access patterns, multithreading asynchronous I/O, etc. It is crucial to capture the I/O behavior and understand the requirement of storage and I/O in these workloads. In this talk, we will present our efforts in developing profiling tools and benchmark suites to address the need. Our study shines light on how to better design storage hardware and I/O software to support AI workloads.

Workshop

Invited Talk: Scaling Computing for Concurrent Data Structures Using Near-Memory Processing Architectures

Accelerators

Edge Computing

Heterogeneous Computing

DescriptionIn recent years, there has been a renewed interest in near-memory processing (NMP) architectures as a workaround for the performance and energy issues of frequent and irregular memory accesses. However, effective use of NMP architectures requires rethinking data structures and their algorithms, especially as these data structures scale up in size well beyond the size of last level caches. In this talk, I will focus on cache-optimized data structures, such as skiplists and B+ trees, often used in online transaction processing (OLTP) systems to enable fast key-based lookups. I will present a hardware/software co-design solution of NMP-aware algorithms for these concurrent data structures and show that our approach can improve performance by more than 2X compared to the state-of-the-art.

Workshop

Invited Talk: Spatial Computing with AMD AI Engines

Architecture and Networks

DescriptionNew compute architectures with high performance cores, distributed memories and hardware accelerated data movement are becoming available in new form factors, even finding their ways into laptops. In this talk, I will discuss one such architecture, AMD’s AI Engine, and show how researchers are taking advantage of open source code to map their applications onto this spatial architecture.

During the talk, I will highlight the unique features of the AI Engine, and describe how open source is making these devices more accessible including new capabilities like dynamic dispatch of kernels. In addition, I will describe an application ported to the AI Engines which is able to outperform CPUs and GPUs to show the breadth of target applications possible on these spatial compute devices.

Workshop

Invited Talk: Using XDMoD for HPC Performance and Quality-of-Service Analysis

Performance Measurement, Modeling, and Tools

Programming Frameworks and System Software

Birds of a Feather

IO500: The High-Performance Storage Community

Data Analysis, Visualization, and Storage

XO/EX

DescriptionAs efficient IO becomes increasingly critical to reach peak computing performance, IO500 has become the de-facto standard for measuring HPC storage performance. Developed in 2017, the IO500 has released bi-annual lists at SC and ISC since then. This BoF’s highlight is the presentation of the new IO500 list.

This BoF’s goal is to foster the IO500 community to progress common goals of creating, sharing, and benefiting from a large corpus of shared storage performance data. We are also building a detailed repository of high-performance production storage systems as they evolve, providing a knowledge base for HPC researchers and system designers.

Workshop

IOMax: Maximizing Out-of-Core I/O Analysis Performance on HPC Systems

Data Analysis, Visualization, and Storage

Data Movement and Memory

DescriptionI/O analysis is an essential task for improving the performance of scientific applications on high-performance computing (HPC) systems. However, current analysis tools, which often use data drilling techniques (iterative exploration for deeper insights), treat every query independently and do not optimize column data for data-slicing (extracting specific data subsets), resulting in subpar querying performance. In this paper, we designed IOMax, a tool for efficient data drilling analysis on large-scale I/O traces. IOMax utilizes a novel query optimization technique to improve the query performance by 8.6x while reducing the memory footprint required for analysis by 11x. Additionally, it employs data transformation techniques to improve data-slicing performance by up to 11.4x. In conclusion, IOMax optimizes I/O analysis for scientific workflows on the Lassen supercomputer, resulting in up to 7x improvement.

Workshop

IPDRM’2023 – Afternoon Break

Distributed Computing

Middleware and System Software

Runtime Systems

Workshop

IPDRM’2023 – Welcome

Distributed Computing

Middleware and System Software

Runtime Systems

Exhibits

Flash Session

IQM’s Path to Quantum Utility

XO/EX

DescriptionLearn how IQM is continuously improving quantum computing technology for superconducting Qubits, moving towards quantum utility for our customers. We explain our technology roadmap, and our product offering, ranging from research and education and HPC environments to industry customers.

Workshop

Is RISC-V Ready for HPC Prime-Time: Evaluating the 64-Core Sophon SG2042 RISC-V CPU

Architecture and Networks

Hardware Technologies

DescriptionThe Sophon SG2042 is the world's first commodity 64-core RISC-V CPU for high performance workloads and an important question is whether the SG2042 has the potential to encourage the HPC community to embrace RISC-V.

We undertake a performance exploration of the SG2042 against existing RISC-V hardware and high performance x86 CPUs in use by modern supercomputers. Leveraging the RAJAPerf benchmarking suite, we discover that on average, the SG2042 delivers, per core, between five and ten times the performance compared to the nearest widely available RISC-V hardware. We found that, on average, the x86 high performance CPUs under test outperform the SG2042 by between four and eight times for multi-threaded workloads, although some individual kernels do perform faster on the SG2042. The result of this work is a performance study that not only contrasts this new RISC-V CPU against existing technologies, but furthermore shares performance best practice.

Invited Talk

Is There Room for HPC in Developing Countries?

Artificial Intelligence/Machine Learning

HPC Infrastructure

DescriptionThis talk summarizes policies, actors and institutions that contributed for the development of HPC in Brazil during the last 40 years. It visits activities at academia, professional societies, industries, federal and state governments related to such purpose. It emphasizes actions that could be useful for other development countries that are willing to invest in HPC.

Workshop

ISAV23 – Best Paper Award and Closing Remarks

Data Analysis, Visualization, and Storage

Large Scale Systems

Performance Measurement, Modeling, and Tools

DescriptionPresenting the best paper of ISAV23.

Closing remarks.

Workshop

ISAV23 – Introduction

Data Analysis, Visualization, and Storage

Large Scale Systems

Performance Measurement, Modeling, and Tools

DescriptionIntroduction to the ISAV23 workshop and presentations.

Workshop

ISAV23 – Morning Break

Data Analysis, Visualization, and Storage

Large Scale Systems

Performance Measurement, Modeling, and Tools

Workshop

ISAV23 Invited Keynote – Progress in In-Situ Analysis and Visualization in the Fusion Exascale Code XGC

Data Analysis, Visualization, and Storage

Large Scale Systems

Performance Measurement, Modeling, and Tools

DescriptionExascale computers are becoming a playground for scientific discovery. Using the extreme-scale kinetic fusion PIC code XGC as a proxy, this presentation will demonstrate the challenges and opportunities of in situ analysis, reduction, and visualization in our high-performance computing ecosystem, with new contributions we have made therein. The first discussion is enabling HPC science studies that have been difficult due to gap in memory size relative to FLOPS. Often, first-principles-level scientific analysis requires deep-level identifications and indexing of simulation high-dimensional simulation objects, which amplify the memory requirements to an impractically high level. Developing in situ approaches for our time and phase-space analysis and visualization, which consider specific features, minimizes the node-memory requirement and enables such studies. Another concern is addressing the growing compute speed to I/O bandwidth gap. Our data is analyzed, visualized, and compressed while being generated, without first storing it to a file system. This enables faster scientific discovery that can be used for quicker feedback to next-day experimental or simulation inputs. We also consider the potential for increased accuracy, where fine temporal and phase-space sampling of transient analysis might expose complex behavior missed by the coarse sampling that is often necessitated by adopting an off-line approach. There is also possibility for the assessment of the error and uncertainty in the predictability of the target science in parallel with the simulation, which could enable automated/AI-assisted simulation steering. Finally, we discuss building more complete data bases for AI/ML training via automated identification and healing of scientific simulation data sets in the phase spaces where previous simulation or experimental data do not exist.

Paper

Itoyori: Reconciling Global Address Space and Global Fork-Join Task Parallelism

Heterogeneous Computing

Programming Frameworks and System Software

Task Parallelism

DescriptionThis paper introduces Itoyori, a task-parallel runtime system designed to tackle the challenge of scaling task parallelism (more specifically, nested fork-join parallelism) beyond a single node. The partitioned global address space (PGAS) model is often employed in task-parallel systems, but naively combining them can lead to poor performance due to fine-grained and redundant remote memory accesses. Itoyori addresses this issue by automatically caching global memory accesses at runtime, enabling efficient cache sharing among parallel tasks running on the same processor. As a real-world case study, we ported an existing task-parallel implementation of the Fast Multipole Method (FMM) to distributed memory with Itoyori and achieved a 7.5x speedup when scaled from a single node to 12 nodes and up to 6.0x faster performance than without caching. This study demonstrates that global-view fork-join programming can be made practical and scalable, while requiring minimal changes to the shared-memory code.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Job Level Communication-Avoiding Detection and Correction of Silent Data Corruption in HPC Applications

XO/EX

DescriptionDetecting and correcting Silent Data Corruption (SDC) is of high interest for many HPC applications due to the dramatic consequences such undetected computation errors can have. Additionally, going into the exascale era of computing, SDC error rates are only increasing with growing system sizes. State of the art methods based on instruction duplication suffer from only partial error coverage, significant synchronization overhead and strong coupling of computation and validation.

This work proposes a novel communication-avoiding approach of detecting and mitigating SDCs at the job level within the workload manager, assuming a directed acyclic graph (DAG) job model. Each job only communicates a locally generated output data hash. Computation and validation are decoupled as separately schedulable jobs and dependency stalling is avoided with a special error recovery method. The implementation of this project within the SLURM workload manager is in progress and key design aspects are outlined.

Workshop

Julia as a Unifying End-to-End Workflow Language on the Frontier Exascale System

Data Analysis, Visualization, and Storage

Large Scale Systems

Programming Frameworks and System Software

Reproducibility

Resource Management

Runtime Systems

DescriptionWe evaluate Julia as a single language and ecosystem paradigm powered by LLVM to develop workflow components for high-performance computing. We run a Gray-Scott, 2-variable diffusion-reaction application using a memory-bound, 7-point stencil kernel on Frontier, the US Department of Energy's first exascale supercomputer. We evaluate the performance, scaling, and trade-offs of (i) the computational kernel on AMD's MI250x GPUs, (ii) weak scaling up to 4,096 MPI processes/GPUs or 512 nodes, (iii) parallel I/O writes using the ADIOS2 library bindings, and (iv) Jupyter Notebooks for interactive analysis. Results suggest that although Julia generates a reasonable LLVM-IR, a nearly 50% performance difference exists vs. native AMD HIP stencil codes when running on the GPUs. As expected, we observed near-zero overhead when using MPI and parallel I/O bindings for system-wide installed implementations. Consequently, Julia emerges as a compelling high-performance and high-productivity workflow composition language, as measured on the fastest supercomputer in the world.

Birds of a Feather

Julia for HPC

Programming Frameworks and System Software

XO/EX

DescriptionThe “Julia for HPC” birds-of-a-feather (BoF) session provides a place for the high-performance computing (HPC) community with interests in the Julia programming language. Julia proposes an integrated development end-to-end co-design model as a LLVM front-end for science to close the gap between high-productivity languages and the desired performance of traditional compiled languages on extreme heterogeneous systems.

We invite participants from academia, government, and industry to share and discuss their experiences, identify and learn about current opportunities and gaps. Potential topics include: community, adoption and support in leadership facilities, the Julia ecosystem, programming models and packages targeting HPC workflows.

Workshop

JuliQAOA: Fast, Flexible QAOA Simulation

Quantum Computing

Software Engineering

DescriptionWe introduce JuliQAOA, a simulation package specifically built for the Quantum Alternating Operator Ansatz (QAOA). JuliQAOA does not require a circuit-level description of QAOA problems, or another package to simulate such circuits, instead relying on a more direct linear algebra implementation. This allows for increased QAOA-specific performance enhancements, as well as improved flexibility and generality. JuliQAOA is the first QAOA package designed to aid in the study of both constrained and unconstrained combinatorial optimization problems, and can easily include novel cost functions, mixer Hamiltonians, and other variations. JuliQAOA also includes robust and extensible methods for learning optimal angles. Written in the Julia language, JuliQAOA outperforms existing QAOA software packages and scales well to HPC-level resources.

Workshop

K-means Clustering: An Assignment for OpenMP, MPI, and CUDA/OpenCL

Education

State of the Practice

DescriptionWe introduce the sixth example in a series of assignments used in a Parallel Computing course to teach approaches to solving the same problem with different parallel programming models. This assignment is based on the K-means clustering algorithm. The program is intentionally designed to be straightforward and easily understandable for students, while also providing specific parallelization and optimization opportunities. It is a simpler example than the previously presented assignments in this series, focusing mainly on key base concepts that many students find complex to apply in a practical case: Race-conditions, reductions, and collective operations. It proposes a clear and guided parallelization and optimization strategy across the different programming models. This assignment can be used to establish a solid foundation before tackling more advanced concepts or parallel structures. This assignment was successfully used as a practical assignment in a Parallel Computing course in the third year of a Computer Engineering degree.

Workshop

k-Nearest Neighboor with Map Reduce MPI

Education

State of the Practice

DescriptionThis is the summary of a peachy parallel assignment centered on classifying objects based on a database of pre-classified objects; in particular this assignment uses the k-Nearest Neighbors method. With the increase of popularity of data science and machine learning, data science assignments have become more engaging for students. In this particular case, we rely on existing databases of machine learning problems to provide real world applications of the k-nearest neighbor algorithm. The databases being fairly large makes the runtime of the algorithm fairly slow, which makes the consideration of parallel computing natural. This incarnation of the assignment uses Map Reduce MPI and was used in an upper division parallel computing class. However the assignment can be adapted as a CS1/CS2 assignment or as a Data Structures assignment.

Workshop

Keynote Session: Bill Tang - Princeton Program in Plasma Physics

State of the Practice

DescriptionBill will provide an overview of the use of digital twins in tokamak fusion research and an update on the latest progress.

Workshop

Keynote Speaker

Performance Optimization

Workshop

Keynote: A Perspective on 1000x Energy Efficiency in 20 Years

Applications

Architecture and Networks

Data Movement and Memory

Heterogeneous Computing

Large Scale Systems

Middleware and System Software

Workshop

Keynote: Design of Efficient and Privacy Preserving Machine Learning

Accelerators

Codesign

Heterogeneous Computing

Task Parallelism

DescriptionThe rapid deployment of machine learning system has witnessed various challenges such as high computation and privacy/security concerns. In this talk, we will first discuss the current challenges and advances in efficient machine learning. We will present several machine learning accelerations through algorithm-hardware codesign, on various computing platforms such as GPU, MCU, and ReRAM. On the other hand, Machine-Learning-As-A-Service (MLaaS) provides cloud-based tools to mitigate the cost and risk of building individual ML platforms. Privacy-preserving machine learning (PPML) serves as a good solution to protect sensitive user data. However, the introduced crypto-primitives come at extra high computation and communication overhead and potentially prohibit the machine learning popularity. We will present a systematic acceleration framework that enables low latency, high energy efficiency and accuracy, and security-guaranteed machine learning.

Workshop

Keynote: Empowering Large AI Models Based on Heterogeneous Memory

Data Movement and Memory

Heterogeneous Computing

DescriptionThe size of large artificial intelligence (AI) models has increased by at least 100x in the past few years, which leads to memory consumption at the scale of hundreds of GBs and even TBs. Recent advance of heterogeneous memory (HM) provides a cost-effective approach to increase memory capacity. Using external memory (e.g., CXL memory expansion and GPU-like accelerator's memory) as an extension to GPU memory, we can build an HM to enable large-scale AI model inference and training without using extra GPUs to accommodate large memory consumption. However, not only HM imposes challenges on tensor allocation and migration on HM itself, but it is also unclear how HM affects training/inference throughput. AI model workload possesses unique characteristics of memory access patterns and data structures, which places challenges on the promptness of data migration, load balancing, and tensor redundancy on GPU. In this talk, I will discuss the work we have been done to optimize the management of HM for large language model and graph neural networks. The key insights in our designs are to leverage AI domain knowledge to reconcile the tensions between multiple design targets (e.g., minimizing tensor migration volume and maintaining high system throughput). Finally, I will discuss the opportunities and challenges for future HM management in the era of large generative models.

Workshop

Keynote: Testing the space between: Extending HPC testing for a complex HPC workflow environment

Programming Frameworks and System Software

State of the Practice

Workshop

Keynote: The Open Chiplet Economy and its Application to HPC

Architecture and Networks

Data Movement and Memory

Resource Management

DescriptionMultiple technological and business trends are driving a change to realizing semiconductor products as systems in package (SiP) that integrate multiple die, usually referred to as chiplets, into a single package. Compared to monolithic SoCs, chiplet-based designs require an evolution in architecture, interfaces, design and manufacturing flows. Several large companies have already made the transition to chiplet-based designs with proprietary tools and technologies. Many have not and need the expertise to do so. A new and open Chiplet economy to help more companies adopt chiplet technologies. It is based on and will require collaboration and standardizations on multiple dimensions, ensuring that companies are able to interact in an open, efficient and scalable manner. This talk will profile the chiplet economy including its motivations, standards, participants and current status. It will close with an overview of the potential relevance of the open chiplet economy to HPC.

Workshop

Keys to Sustainable Leadership Supercomputing for 2025+: Location, Power, and Flexibility

Energy Efficiency

Green Computing

Sustainability

DescriptionLeadership Supercomputing increasingly requires consideration of sustainability (carbon-emissions) and power cost (Opex) beyond traditional architecture, system software, and application concerns. Power extrapolation of Top 500 systems suggests that next generation leadership systems will approach ~100MW with even higher power levels beyond.

The past decade has shown available power supply as a real limit. With examples of power restriction a reality in Japan, Germany, China, and the UK. These challenges affect cloud as well with numerous datacenter reductions stemming from power shortages. Further, power Opex of supercomputer systems have become a significant part of TCO/lifetime cost, limiting lifetime delivered capability. A concomitant problem as power consumption grows and climate change accelerates is growing pressure to reduce operational carbon footprint.

Leadership Supercomputing in the modern era (2025+) requires large quantities of green power and at low cost. How is this possible? We discuss three dimensions that must be considered:

1. Location: where supercomputing resources are shapes power opportunity, and thus system design and operation
2. Power: dynamics of local power grid, both generation and local competitive load dictate both power infrastructure design and how systems should be operated and applications scheduled/managed
3. Flexibility: the new challenge for operations and applications is flexibility in the time, shape, and location of computing use

We will describe opportunities that are the key the sustainable, leadership supercomputing in the modern (2025+ era). These opportunities enable 90% reduction in power cost, even larger reductions in scope 2 carbon emissions, and the changes in both supercomputer design as well as facilities design and operation to achieve. We are working with leaders in Japan and the US DOE to capture these opportunities.

Birds of a Feather

Khronos SYCL: What’s Next?

Programming Frameworks and System Software

XO/EX

DescriptionThe SYCL programming model provides an open standard way to program heterogeneous systems in modern C++. Since the major SYCL2020 release, which added abstractions and features for HPC, SYCL has seen increased use in application domains needing large exascale-class machines, including fusion energy, molecular dynamics, and aerospace.

In this Birds of a Feather session, we will bring together the community of everyone using and developing SYCL applications and implementations. We will discuss future directions and seek feedback on priorities for SYCLNext. A panel of SYCL experts, runtime/compiler implementers, and application specialists will lead an audience discussion and Q&A.

Birds of a Feather

Knowledge Graphs: How Will They Transform Science?

Applications

XO/EX

DescriptionDiverse big data, interdisciplinary science, ML/AI applications and in-situ computations necessitate knowledge representation. Knowledge, organized for machine understanding in graph form known as knowledge graphs, augments large-scale science. For example, biology and semantic web utilize large knowledge graphs. Utilizing AI, knowledge graphs enable natural language querying of linked information, semantic recommendation systems, and knowledge completion. HPC challenges abound including parallelizing queries, retrieval-efficient knowledge representation, and knowledge graph context-exploiting AI. This BoF will introduce big ideas, as lightning talks followed by discussion, and engage a general audience in a discussion of emerging research topics aiming to seed a community for collaboration.

Workshop

Kubeflow-as-a-Service on HPC clusters – First Experiences

DescriptionDevelopment platforms specific to domain sciences has the potential to improve user's productivity on a HPC cluster by smoothing the steep learning curve using it. These platforms also help abstracting certain practices the user must implement to get the optimal performance out of the allocated resource. These objectives require pre-work, both on the systems side and at the application level. The presentation discusses the first experiences of prototyping with Kubeflow and deploying it as-a-service to be shared by multiple users. The deployment was designed with HPC cluster or multi-node cloud instances as target computational resources. Kubeflow is an opensource platform to make deployment of ML/DL workloads easy. It depends on Kubernetes. Kubeflow offers a simple UI for interactive computing, orchestration of workflows using Kubeflow pipeline, and an intuitive interface for hyperparameter tuning experiments using Katib. These are attractive features when considering the ease of use in deploying software environments for model and workflow development for users in academic research settings on cloud and university HPC cluster.

Workshop

Laminar: A New Serverless Stream-Based Framework with Semantic Code Search and Code Completion

Applications

Cloud Computing

Distributed Computing

Edge Computing

Large Scale Systems

DescriptionThis paper introduces Laminar, a novel serverless framework based on dispel4py, a parallel stream-based dataflow library. Laminar efficiently manages streaming workflows and components through a dedicated registry, offering a seamless serverless experience. Leveraging large language models, Laminar enhances the framework with semantic code search, code summarization, and code completion. This contribution enhances serverless computing by simplifying the execution of streaming computations, managing data streams more efficiently, and offering a valuable tool for both researchers and practitioners.

Exhibitor Forum

Large Scale Accelerated Rendering on 10K Ray Tracing Enabled Nodes

Artificial Intelligence/Machine Learning

Fault Handling and Tolerance

Large Scale Systems

Programming Frameworks and System Software

XO/EX

DescriptionIn this talk, Intel will describe how oneAPI Rendering Toolkit (RenderKit) was enhanced for multi-architecture multi-platform large scale workloads. We will showcase some of our work that was deployed onto Argonne’s Aurora Supercomputer, one of the first exascale machines on the planet and the challenges that we met along the way. The high-performance capabilities of the machine's 20000 Saphire Rapids HBM CPUs and 60000 Intel Data Center Max (aka PonteVecchio) GPUs, including accelerated ray tracing hardware in the GPUs, are exercised through RenderKit’s use of Intel's OneAPI Data Parallel C++ Sycl Implementation.

This talk will also introduce the newly updated architecture of OSPRay, the Open, Scalable, and Portable Ray Tracing Engine. It will highlight OSPRAY’s multi-GPU capabilities on workloads including Surface and Volume Rendering of LANL’s Deepwater Impact Asteroid data and Argonne’s Stellar Radiation data sets which were performance tuned in collaboration with Argonne. This will include OSPRay’s performance within resources like Kitware’s ParaView and LLNL’s VisIt.

Furthermore, we will share our approach to solving problems with large scale rendering and future performance opportunities we will continue to push for. We will also discuss our efforts with rendering at scale has enabled emerging industry segments like Digital Twins on HPC infrastructure.

ACM Gordon Bell Finalist

Awards

Large-Scale Materials Modeling at Quantum Accuracy: Ab Initio Simulations of Quasicrystals and Interacting Extended Defects in Metallic Alloys

DescriptionAb initio electronic-structure has remained dichotomous between achievable accuracy and length-scale. Quantum many-body (QMB) methods realize quantum accuracy but fail to scale. Density functional theory (DFT) scales favorably but remains far from quantum accuracy. We present a framework that breaks this dichotomy by use of three interconnected modules:

(i) invDFT: a methodological advance in inverse DFT linking QMB methods to DFT;

(ii) MLXC: a machine-learned density functional trained with invDFT data, commensurate with quantum accuracy;

(iii) DFT-FE-MLXC: an adaptive higher-order spectral finite-element (FE) based DFT implementation that integrates MLXC with efficient solver strategies and HPC innovations in FE-specific dense linear algebra, mixed-precision algorithms, and asynchronous compute-communication.

We demonstrate a paradigm shift in DFT that not only provides an accuracy commensurate with QMB methods in ground-state energies, but also attains an unprecedented performance of 659.7 PFLOPS (43.1% peak FP64 performance) on 619,124 electrons using 8,000 GPU nodes of Frontier supercomputer.

Paper

Large-Scale Simulation of Structural Dynamics Computing on GPU Clusters

Accelerators

Applications

Modeling and Simulation

DescriptionStructural dynamics simulation plays an important role in research on reactor design and complex engineering. The Hybrid Total Finite Element Tearing and Interconnecting (HTFETI) method combined with Newmark method is an efficient way to solve large-scale structural dynamics problems. However, the sparse direct solver and the load imbalance caused by inconsistent density models are two critical issues limiting the performance and the scalability of structural dynamics computing. For the former, we propose an efficient variable-size batched method to accelerate SpMV on GPUs. For the latter, we establish an online performance prediction model, based on which we then design a novel inter-cluster subdomain fine-tuning algorithm to balance the workload of HTFETI parallel computing. We are the first to achieve the high-fidelity structural dynamics simulation of China Experimental Fast Reactor core assembly with up to 53.4 billion grids. The weak and strong scalability efficiencies reach 91.77% and 86.13% on 12,800 GPUs, respectively.

Workshop

Latency and Bandwidth Microbenchmarks of US Department of Energy Systems in the June 2023 Top 500 List

Modeling and Simulation

Performance Measurement, Modeling, and Tools

DescriptionAs a rule, Top 500 class supercomputers are extensively benchmarked as part of their acceptance testing process. However, barring publicly posted LINPACK / HPCG results, most benchmark results are often inaccessible outside the hosting institution. Moreover, these higher level benchmarks do not provide easy answers to common questions such as “What is the realizable memory bandwidth?” or “What is the launch latency on the accelerator?” To partially address these issues, we executed selected single-node micro-benchmarks — focused on latencies and memory bandwidth — on every US Department of Energy system above rank 150 of the June 2023 Top 500 list, excepting NERSC’s Cori and ORNL’s Frontier TDS (now decommissioned or repurposed). We hope to provide an easy “first stop” reference for users of current Top 500 systems and inspire users and administrators of other Top 500 systems to similarly compile and make available benchmark results for their systems.

Invited Talk

Launching the National AI Research Resource (NAIRR) through a Coordinated Pilot

Artificial Intelligence/Machine Learning

HPC Infrastructure

DescriptionThe National AI Initiative Act of 2020 established the National AI Research Resource (NAIRR) Task Force to investigate establishing a national infrastructure for AI research. After 18 months, the task force produced a report detailing a vision for the NAIRR as a widely accessible US national infrastructure comprised of a set of federated resources including high performance computing, cloud computing, testbeds, software and datasets with accompanying user support and educational and training materials. The overarching goal of the envisioned NAIRR is to strengthen and democratize the U.S. AI innovation ecosystem by spurring innovation, increasing the diversity of talent in AI, improving US AI R&D capacity, and advancing trustworthy AI, including increasing research opportunities in critical areas such as testing and evaluation, bias mitigation, AI safety and privacy.

Today, a US Government Interagency Working Group led by OSTP and NSF is deploying a NAIRR pilot to demonstrate the value, capabilities, and impact of the NAIRR concept, reach broad communities, expose technical issues early, and test drive the proposed NAIRR governance structure. This talk will describe the latest status and plans for the NAIRR pilot.

Paper

Legate Sparse: Distributed Sparse Computing in Python

Heterogeneous Computing

Programming Frameworks and System Software

Task Parallelism

DescriptionThe sparse module of the popular SciPy Python library is widely used across applications in scientific computing, data analysis, and machine learning. The standard implementation of SciPy is restricted to a single CPU and cannot take advantage of modern distributed and accelerated computing resources. We introduce Legate Sparse, a system that transparently distributes and accelerates unmodified sparse matrix-based SciPy programs across clusters of CPUs and GPUs, and composes with cuNumeric, a distributed NumPy library. Legate Sparse uses a combination of static and dynamic techniques to performantly compose independently written sparse and dense array programming libraries, providing a unified Python interface for distributed sparse and dense array computations. We show that Legate Sparse is competitive with single-GPU libraries like CuPy and the industry-standard PETSc library on up to 1280 CPU cores and 192 GPUs of the Summit supercomputer, while offering the productivity benefits of idiomatic SciPy and NumPy.

Birds of a Feather

Less Worrying, More Learning, More Sharing – Ways to Embrace IPv6

Architecture and Networks

XO/EX

DescriptionIPv6 is quickly becoming the dominant protocol on the internet. As the global transition from IPv4 to IPv6 continues, many ISPs are now seeing over 50% of their traffic via IPv6. SCinet22 saw wireless IPv6 traffic ranging from 35-55%. This BoF continues the engagement from SC22 with discussions centered on international migration efforts, cyber security, HPC, IPAM and real-time IPv6 usage from SCinet23! Join our discussion on the efforts, implications and challenges for transitioning HPC, data centers and networks. Ask questions, provide updates, and hear from others about their real-world experience - learn all the ways you can embrace IPv6.

Workshop

Let’s Get Our Heads Out of the Clouds (A Scalable and Sustainable Approach to HPC Training Labs for Resource Constrained Environments and Anyone Else Stuck in the Clouds)

Education

State of the Practice

DescriptionThe training of new and existing HPC practitioners is recognized as a priority in the HPC community. Traditionally, delivering HPC System Administrator training has been through physical face-to-face workshops, using cloud-based services or remote hardware to provide compute resources to emulate an HPC system.

We have identified several challenges associated with the reliance on cloud-based services for hosting HPC training workshops, including: class size is limited by the available compute resources provided on the hosted resource; the training is a non-starter without available cloud resources; the hosted resources are temporary.

To address these fundamental problems associated with the traditional cloud-hosted HPC Labs, and by following lessons learned from MOOC & Educational methodology on developing HPC Training, we have produced a reproducible, offline-capable, self-paced HPC virtual training lab that emulates a basic 3-node compute cluster on a trainee’s local machine without the need for any high-end computing resources or cloud infrastructure.

Workshop

Leveraging Large Language Models to Build and Execute Computational Workflows

Data Analysis, Visualization, and Storage

Large Scale Systems

Programming Frameworks and System Software

Reproducibility

Resource Management

Runtime Systems

DescriptionThe recent development of large language models (LLMs) with multi-billion parameters, coupled with the creation of user-friendly application programming interfaces (APIs), has paved the way for automatically generating and executing code in response to straightforward human queries. This paper explores how these emerging capabilities can be harnessed to facilitate complex scientific workflows, eliminating the need for traditional coding methods. We present initial findings from our attempt to integrate Phyloflow with OpenAI's function-calling API, and outline a strategy for developing a comprehensive workflow management system based on these concepts.

Tutorial

Leveraging SmartNICs for HPC Applications

Accelerators

Applications

TUT

DescriptionThe past few years have witnessed a surge in the number of advanced network adapters, known as "SmartNICs", that offer additional functionalities beyond standard packet processing capabilities. These devices often feature programmable lightweight processing cores, FPGAs, and even CPU- and GPU-based platforms capable of running separate operating systems. Though primarily aimed at data center operations, such as infrastructure management, packet filtering, and I/O acceleration, SmartNICs are increasingly being explored for high-performance computing (HPC) application acceleration.

This tutorial offers an in-depth exploration of the state-of-the-art for SmartNICs and the emerging software ecosystems supporting them. Attendees will engage in hands-on exercises to better understand how to use SmartNICs for HPC application acceleration, including MPI collective operation offloading, OpenMP offload, and algorithmic modifications to maximize on-board processing power. Participants will have the opportunity to execute these exercises using cutting-edge SmartNICs like NVIDIA's BlueField-3 Data Processing Unit (DPU). The tutorial presenters will discuss additional techniques for optimizing applications to harness SmartNICs as communication accelerators in HPC systems.

Paper

Leveraging the Compute Power of Two HPC Systems for Higher-Dimensional Grid-Based Simulations with the Widely-Distributed Sparse Grid Combination Technique

Algorithms

Cloud Computing

Distributed Computing

Heterogeneous Computing

Large Scale Systems

State of the Practice

DescriptionThis paper presents the core concepts of the widely-distributed combination technique, which allows us to use the compute power and memory of more than one HPC system for the same simulation. We apply the sparse-grid combination technique to a six-dimensional advection problem serving as a proxy for plasma simulations. The full-grid solution approximated by the combination technique would contain ≈5ZB if computed with conventional grid-based methods. The combination-technique simulation operates on ≈988GB plus the supporting sparse grid data structures. We propose a new approach to divide the compute load, requiring only 76GB to be exchanged. Based on this, we have realized the first synchronous grid-based simulation using two HPC systems, the Tier-0 supercomputers Hawk and SuperMUC-NG. The simulation is computed at an average overhead of ≈35% (108s per combination step) for file-I/O and transfer. The presented concepts apply to any pair of HPC systems if high-speed data transfer is possible.

Workshop

LibPressio-Predict: Flexible and Fast Infrastructure For Inferring Compression Performance

Data Analysis, Visualization, and Storage

Data Compression

DescriptionOver recent years, substantial efforts have gone into developing systems to infer compression performance without running compressors. These efforts have driven down the error in the estimates, reduced their runtimes, and improved their generality. However, these efforts are uncoordinated increasing the efforts required to perform comparisons between them. There may be subtle differences in sampling approaches, and nuances to the interfaces requiring efforts to port applications between them and to reproduce experiments. Additionally, many of these methods call for substantial amounts of training data to produce reliable estimates, as well as scalable codes to perform the training. In this work, we present LibPressio-Predict -- a scalable library for use in applications using predictions of compression performance and a scalable tool LibPressio-Bench to run these experiments quickly at scale. We use this tool to evaluate 3 recent compression prediction approaches systematically with all 48 timesteps and 13 fields Hurricane Issable dataset.

Workshop

Life as an RSE at the University of Birmingham, UK

Software Engineering

Workshop

Lightning Talk – Automating Loop Optimization with Code Samples and AST Matching

Compilers

Heterogeneous Computing

Performance Optimization

DescriptionLoop modifications are critical steps of code optimization. They can allow other techniques to be used, improve performance by reducing memory traffic with better data locality, or reduce loop overhead. Unfortunately, the error-prone index arithmetic calculations involved can make these modifications cumbersome for a human programmer. These modifications also cannot be implemented using a text-based tool, such as sed, because they require semantic knowledge about the keywords and symbols used. Compilers also may not implement all of these transformations, or may not always apply them if correctness or profitability analyses are inconclusive.

Previously, we presented an approach to code rewriting, MARTINI, which exposes complex and semantic-driven rewrite capabilities, based in the program's abstract syntax tree (AST), to users in a simple and natural way. In this paper, we show how this approach can be used to implement source-based loop optimizations with only before-and-after code samples written in the source language.

Workshop

Lightning Talk – Just-in-Time Autotuning

Compilers

Heterogeneous Computing

Performance Optimization

DescriptionPerformance portability is a major concern on current architectures. One way to achieve it is by using autotuning. In this paper, we are presenting how we extended a just-in-time compilation infrastructure to introduce autotuning capabilities triggered at run-time. When a function is executed, the first iterations optimize it, and once the best solution has been found, it is used for subsequent calls to the function. This just-in-time autotuning infrastructure is relevant for optimizing computation kernels that will be called numerous times with similar parameters through the execution, re-optimizes kernels when they are called with other parameters, and the programmer can obtain the optimal parameters to use them for other kernels. We present an experimental performance evaluation of our approach. Compiling the code introduces an overhead on the first iterations, and this overhead is compensated for during subsequent iterations. We also determined that the optimum found seems stable and accurate.

Workshop

Lightning Talk – META: A Toolkit for Template Metaprogramming Performance Analysis

Compilers

Heterogeneous Computing

Performance Optimization

DescriptionHPC developers often develop domain-specific languages and libraries to improve productivity. The software implementing these languages and libraries employ advanced C++ language techniques such as template metaprogramming. HPC-oriented libraries employing template metaprogramming techniques permit a substantial level of customization and portability across multiple environments. Although applications developed using these libraries can be performant, they may lead to performance regressions. These regressions can be challenging for the compiler to identify and correct. Without an understanding of the compiler underlying the HPC-aligned library in use or the target hardware, such issues may remain undetected and unresolved.

META, a portable static analysis infrastructure, addresses these challenges by extending the LLVM compiler toolchain such that it can not only detect performance regressions but make concrete suggestions about how to best modify an application written with C++ parallel template metaprogramming libraries.

Workshop

Lightning Talk - Cppless: Productive and Performant Serverless Programming in C++

Compilers

Heterogeneous Computing

Performance Optimization

DescriptionThe rise of serverless introduced a new class of scalable, elastic and highly available parallel workers in the cloud. Many systems and applications benefit from offloading computations and parallel tasks to dynamically allocated resources. However, the developers of C++ applications found it difficult to integrate functions due to complex deployment, lack of compatibility between client and cloud environments, and loosely typed input and output data. To enable single-source and efficient serverless acceleration in C++, we introduce Cppless, an end-to-end framework for implementing serverless functions which handles the creation, deployment, and invocation of functions. Cppless is built on top of LLVM and requires only two compiler extensions to automatically extract C++ function objects and deploy them to the cloud. We demonstrate that offloading parallel computations from a C++ application to serverless workers can provide up to 30x speedup, requiring only minor code modifications and costing less than one cent per computation.

Workshop

Lightning Talk: Datastates for Debugging – Using Productive Checkpointing for Improved Debugging

Fault Handling and Tolerance

DescriptionDebuggers are powerful and productive tools to understand and fix correctness issues in parallel programs. Tools like `rr` and `gdb` even introduce capabilities for “reverse time” debugging allowing the user to step backward through the states of the program. However, these tools were not designed to be scalable, have high overhead, or are not usable by distributed MPI applications. In this talk, we discuss ongoing extensions to MPIGDB – a scalable open-source debugger for MPI programs – that integrate it with an application checkpointing framework to allow more scalable reverse-time debugging for MPI applications as a use case of productive checkpointing. Specifically, we propose: (1) Extending the debugger to allow “fast-forwarding” and “rewinding” to checkpoints interactively. (2) Allowing the user to “diff” states between checkpoints to understand how the program evolved. (3) Leveraging the knowledge of an application-level checkpointing framework to reduce the storage and memory overheads of reverse time debugging. In this talk, we will discuss the kinds of extensions needed by checkpointing libraries to support this use case, the performance implications of the different approaches to reverse time debugging, and showcase some debugging challenges that are simplified by leveraging this kind of capability.

Workshop

Lightning Talk: Diaspora – Resilient Event Processing for Irregular, Distributed Scientific Applications

Fault Handling and Tolerance

DescriptionModern science increasingly requires the coordinated use of advanced computing, networking, instruments, and experimental facilities: collectively, research infrastructure. This infrastructure reaches from HPC systems to high-data-rate instruments and less well-connected edge systems, and also encompasses cloud-hosted services. These resources and their applications can generate many events, and because many science applications span locations, scientists need to consume events from many sources. To meet this need, we are developing Diaspora, a resilient, hierarchical event streaming approach that scales to meet the needs of modern science. Such complex, distributed applications have myriad hard and soft failure modes. The widely used coordinated checkpoint-restart resilience solution simply requires that processes agree on a globally consistent state, which then can be independently captured piece-wise by the processes and restarted from in case of failures. However, such approaches have limited applicability at very large scales that may involve geographically distributed resources, because the problem of agreeing on a globally consistent state is not tractable. Under such circumstances, there is a need to envision new abstractions to achieve resilience. This talk briefly introduces such abstractions that we propose in Diaspora. Notably, we envision the use of an event-streaming backbone that allows both loosely and tightly coupled workflow components to communicate and persist data in a resilient fashion. This context opens new opportunities to apply checkpointing techniques, which we will highlight. Furthermore, we will also describe the scientific applications targeted by the project, including federated learning, astronomical image processing, and x-ray image processing at advanced photon sources.

Workshop

Lightning Talk: Inherent Checkpointing Properties of Nested Parallelism

Fault Handling and Tolerance

DescriptionOur discovery was a complete accident. We worked on nested parallelism in OpenMP for easier work decomposition and better work balancing:

#pragma omp parallel num_threads(N)
{
#pragma omp parallel num_threads(M)
{
#pragma omp parallel num_threads(K)
{
}
}
}

Creating thousands of threads (for real-life N, M, K values) was out of question, and queuing thousands of tasks at each nesting level had a significant sequential overhead, so we came up with a concept of task groups.

A task group is a descriptor that has information on the entire data range for all tasks and on the number of tasks expected to process that data range (N, M, or K, for each nesting level, respectively).

The task group descriptor can be queued by the main thread, and worker threads will create individual tasks from that descriptor, in parallel. Each individual task is responsible for finding its portion of the data range based on the task’s index within the group.

That sounded like how GPU kernels typically worked. Hence, we converted nested parallel task groups into GPU kernels at a pre-compilation stage, to let our runtime offload those kernels (or task groups) to GPUs.

Now, when we already have individual kernels that can locate their data, why not compile those kernels into self-contained libraries that can be executed on remote systems? That’s how each OpenMP-like nested parallel region became a task group, a kernel, and a library.

And then we ran into all sorts of issues once we tested the runtime on loosely coupled nodes on a local network (some GPUs were not super-reliable either).

More thinking brought understanding that our runtime actually “knows” whether the execution on a remote node or a GPU was unsuccessful, so it can report that status back to the main program.

We were able to craft a runtime that enables nested parallel execution on remote systems and GPUs (and GPUs on remote systems) and reports the execution status for each nesting level of parallel work.

What a programmer needs to do now, is simply check for error after the return from a parallel region and restart execution (given the inputs are separate from outputs and have not been overwritten):

#pragma omp parallel num_threads(N)
{
while(1)
{
int error=0;
#pragma omp parallel num_threads(M) status(&error)
{
#pragma omp parallel num_threads(K)
{
}
}
if(!error)
{
break;
}
}
}

We implemented N-body simulation (millions of bodies) and ran it on two systems: one significantly more powerful than the other, to ensure active work-stealing over network. The experiment took roughly 30 minutes, which gave us plenty of time for disrupting it by deleting all intermediate files generated by our runtime and switching network connections on and off. After comparing the outputs with those from an undisturbed run, we found the results to be identical (within the GPU floating-point rounding error).

That is why we now advocate for nested parallelism as a promising solution for remote parallel execution, with “natural” checkpoints and simple restarting capabilities at each nesting level.

Workshop

Lightning Talk: Toward Efficient Asynchronous Checkpointing for Large-Language Models

Fault Handling and Tolerance

DescriptionLarge-language models (LLMs) have been rapidly and widely adopted across research, academia, and enterprises for exploring different endeavors ranging from scientific and educational pursuits to financial and legal assistance. Unsurprisingly, training such sophisticated LLMs requires large-scale infrastructures, typically consisting of accelerators such as GPUs, and spans across multiple months depending on the size of the model and training data. Unfortunately, the GPU memory is in the range of tens of GBs and cannot hold multi-billion parameter models which are typically hundreds of GBs in size. Therefore, a combination of data, model, and tensor parallel techniques are applied to enable training such LLMs, which shard and distribute the model and its associated states across different GPUs. In this context, there is a frequent need to roll back the training of such LLMs to past stable states. This can happen for various reasons: failure of components when running at scale, the need to fine-tune the model and try a different training direction, inspect the evolution of the training to understand how it converges, etc. To this end, state-of-the-art LLM training runtimes such as Deepspeed, PyTorch, etc. use synchronous or partially synchronous checkpointing strategies, which lead to runtime overheads of up to 41% due to I/O bottlenecks. In this talk, we discuss the challenges of adopting existing multi-level checkpointing libraries for distributed LLM training and present novel strategies to perform efficient asynchronous multi-level checkpointing of distributed LLMs to minimize the checkpointing overheads. In particular, our approach is driven by key design ideas such as (1) block training only when attempting to overwrite unflushed tensors; (2) use pinned host memory for faster device-to-host transfers using GPU copy engines; (3) consistently capturing and serializing model states distributed across device and host memory; and (4) selectively flushing checkpoints to minimize storage and I/O bottlenecks.

Workshop

Lightning Talk: Trade-Offs For Developing File Aggregated I/O For Asynchronous Checkpointing

Fault Handling and Tolerance

DescriptionAsynchronous checkpoint-restart (C/R) has become popular in recent years for its ability to checkpoint alongside the application. One implementation is VELOC, which quickly checkpoints to a local storage device, and then flushes the checkpoints to the PFS concurrently with the application. VELOC natively adopts a file-per-process checkpointing strategy, meaning each distributed application process creates its own checkpoint file. File-per-process is easy to implement, enables embarrassing parallelism, and often provides high throughput by by-passing strict POSIX semantics. At sufficient scale, file-per-process is difficult for users to manage, making the checkpoints hard to verify, migrate and/or manage. Further, file-per-process strategies are not scalable, due to oversubscription of underlying storage hardware at significant scale, resulting in lower overall performance of the application and persisting the checkpoint.

To alleviate such challenges, asynchronous C/R must adopt file aggregation techniques. However, this is a nontrivial problem to solve as aggregation requires coordination (e.g. synchronization) between processes and compute nodes to facilitate aggregation, while also respecting the complex nature of resource competition between the application and asynchronous C/R implementation. The most common implementation of aggregation is known as two-phase I/O, where-in a subset of processes are designated as I/O leaders to flush data to the parallel file system (PFS) on behalf of all processes. State-of-the-art implementations of two-phase I/O, such as MPI-IO, overlap the data exchange phase and the flushing phase. However, previous works have shown that state-of-the-art aggregation methods, like MPI-IO, are insufficient for asynchronous checkpointing due to the inherent synchronization cost to perform I/O aggregation. Further, it has no mechanism for respecting resource consumption thereby negatively impacting the concurrently running application. This talk discusses our work towards developing a tunable I/O aggregation strategy that operates efficiently in the background to complement asynchronous C/R. We analyze trade-offs and discuss the performance impact on large-scale microbenchmarks for developing such strategies. Specifically, we explore how to (1) develop efficient, thread-safe data receptions on limited-sized write buffers of I/O leaders, (2) prioritize remote (from non-leaders) and local data on I/O leaders to minimize checkpoint overhead, and (3) load-balance flushing on the I/O leaders.

Workshop

Lightning Talk: Update on Checkpointing and Localized Recovery for Nested Fork-Join Programs

Fault Handling and Tolerance

DescriptionAn important use case for checkpointing is the protection of programs against permanent node failures in clusters. Ideally, checkpoints should be written transparently, i.e., without the need to change the program; and efficiently, i.e., they should include only relevant data. Other frequently desired features are shrinking recovery, i.e., the program should continue to run after failures on the intact resources; and localized recovery, i.e., the recovery should be performed by a few affected processes only.

Asynchronous Many-Task programming (AMT) provides potential to achieve all of the above goals at the same time, since a runtime system can automatically extract relevant data to be checkpointed; and tasks can be re-run after failure. To realize the potential, however, tasks must have clean interfaces, and specialized algorithms are needed for different types of AMT.

One AMT type is nested fork-join programs (NFJ), as exemplified by Cilk. NFJ tasks recursively spawn child tasks. Each task returns a result to its parent and the root task calculates the final result. We disallow side effects in tasks. Many NFJ runtime systems deploy work stealing with a work-first policy.

A task-level checkpointing and recovery scheme for this setting has been presented by the author at the 2021 SuperCheck Workshop [1]. At that time, the algorithm was only outlined, but meanwhile we implemented and experimentally evaluated it [2]. The proposed talk will provide a corresponding update.

Our algorithm uses uncoordinated checkpointing. The checkpoints contain task descriptors, task results, and some status information. They are written to a resilient store. Checkpoints are updated regularly and when tasks are moved due to work stealing, recovery, or result return. Thereby consistency is achieved through specific protocols. Recovery is performed by a buddy and by recent stealing partners. Altogether the algorithm ensures that the program outputs the correct result regardless of the number of failures, even if they occur simultaneously or during recovery. Only when all processes or the resilient store fail, the program aborts with an error message.

We implemented the algorithm by extending a recent cluster implementation of NFJ, which uses one multithreaded process per cluster node, private task pools and lifeline-based victim selection. All threads participate in the work stealing, i.e., any local or remote thread can be the victim of any other. The implementation was done with the "APGAS for Java" programming framework. Moreover, we used a resilient replication-based distributed in-memory store, called IMap, to save our checkpoints. The IMap supports transactions, which are needed by our protocols.

Experiments were run with three benchmarks (Fibonacci, UTS, synthetic benchmark), using up to 1280 threads on 32 homogeneous nodes. We observed an increase of the running time in failure-free runs by up to 28%, and neglectable costs for recovery.

[1] C. Fohry: Checkpointing and Localized Recovery for Nested Fork-Join Programs. Int. Symp. on Checkpointing for Supercomputing, 2021.

[2] L. Reitz, C. Fohry: Task-Level Checkpointing for Nested Fork-Join Programs using Work Stealing. Workshop Proc. Euro-Par, 2023, to appear.

Workshop

Lightning Talks

Architecture and Networks

Workshop

Lightning Vendor Talk: E4 Experience with RISC-V in HPC

Architecture and Networks

Hardware Technologies

DescriptionMonte Cimone is a fully-operational multi-blade computer prototype and hardware-software test-bed based upon E4's RV007 blade system which comprises of the SiFive Freedom U740 SoC, which is a double-precision capable multi-core, 64-bit RISC-V CPU. In this talk, I will provide an overview of the current generation Monte Cimone HPC system and describe our work preparing for our next generation RISC-V HPC system.

Workshop

Lightning Vendor Talk: Esperanto Technologies ET-SoC for AI and ML Workloads

Architecture and Networks

Hardware Technologies

Workshop

Lightning Vendor Talk: SG2042 Empowering RISC-V in High-Performance Computing

Architecture and Networks

Hardware Technologies

DescriptionSophgo introduced the industry's first server-grade RISC-V CPU, the SG2042, which is helping RISC-V make strides in the high-performance computing arena. With a 9-12 stage pipeline design, out-of-order execution support, and a clock speed of up to 2GHz, the SG2042 features 16 clusters, each with a maximum of 4 cores, resulting in a single SoC chip boasting 64 cores and 64MB of shared L3 cache. With the collaboration of industry partners, research institutions, and the open-source community, SG2042 is accelerating the construction of the entire RISC-V software stack with high-performance hardware, ranging from operating systems to mathematical computing libraries.

Base on the SG2042, Sophgo is soon to release the SG2044, which supports vector 1.0, and also a high-performance RISC-V CPU based on SiFive IP Core. These offerings will accelerate the adoption of RISC-V in high-performance computing, data centers, and AI/ML scenarios, with a focus on delivering high performance, low power consumption, and affordability, thereby contributing to the growth of the RISC-V HPC ecosystem.

Workshop

Lightning Vendor Talk: The InspireSemi Next Gen Thunderbird Compute Accelerator for HPC, AI, and Graph Analytics

Architecture and Networks

Hardware Technologies

DescriptionDoug will give an overview of Austin-based InspireSemi’s disruptive next generation Thunderbird compute accelerator targeting HPC and graph analytics applications. This RISC-V based “supercomputer-cluster-on-a-chip” packs 1,536 high performance CPU cores (all FP64 double-precision of their own design) onto a single SOC, and the initial product will be a PCIe card with 4 Thunderbird chips, delivering >6,000 FP64 cores. For maximum/predictable performance and low latency, these CPU cores are all interconnected with their high speed mesh network fabric which can connect up to 256 Thunderbird chips. After >3 years of customer-driven development, the chip is in final verification and will tape out in November to TSMC.

Workshop

Lightweight Isolation for HPC Applications

DescriptionContainerization approaches based on namespaces offered by the Linux kernel have seen an increasing popularity in the HPC community both as a means to isolate applications and as a format to package and distribute them. However, their adoption and usage in HPC systems faces several challenges. These include difficulties in unprivileged running and building of scientific application container images directly on HPC resources, increasing heterogeneity of HPC architectures, and access to specialized networking libraries available only on HPC systems. These challenges of container-based HPC application development closely align with the several advantages that a new universal intermediate binary format called WebAssembly (Wasm) has to offer. These include a lightweight userspace isolation mechanism and portability across operating systems and processor architectures. This talk highlights the potential of using Wasm for packaging and distributing HPC applications and describes MPIWasm, a novel Wasm embedder that enables the high-performance execution of MPI-based HPC applications compiled to Wasm.

Workshop

Linking the Dynamic PicoProbe Analytical Electron-Optical Beam Line / Microscope to Supercomputers

Large Scale Systems

Performance Measurement, Modeling, and Tools

Software Engineering

DescriptionThe Dynamic PicoProbe at Argonne National Laboratory is undergoing upgrades that will enable it to produce up to 100s of GB of data per day. While this data is highly important for both fundamental science and industrial applications, there is currently limited on-site infrastructure to handle these high-volume data streams. We address this problem by providing a software architecture capable of supporting large-scale data transfers to the neighboring supercomputers at the Argonne Leadership Computing Facility. To prepare for future scientific workflows, we implement two instructive use cases for hyperspectral and spatiotemporal datasets, which include: (i) off-site data transfer, (ii) machine learning/artificial intelligence and traditional data analysis approaches, and (iii) automatic metadata extraction and cataloging of experimental results. This infrastructure supports expected workloads and also provides domain scientists the ability to reinterrogate data from past experiments to yield additional scientific value and derive new insights.

Workshop

LLVM in the Age of LLMs: Machine Learning for IR and Optimization and More

Artificial Intelligence/Machine Learning

Software Engineering

Workshop

LLVM-HPC2023 – Morning Break

Compilers

Heterogeneous Computing

Performance Optimization

Workshop

LLVM-HPC2023: The Ninth Workshop on the LLVM Compiler Infrastructure in HPC

Compilers

Heterogeneous Computing

Performance Optimization

DescriptionLLVM, winner of the 2012 ACM Software System Award, has become an integral part of the software-development ecosystem for optimizing compilers, dynamic-language execution engines, source-code analysis and transformation tools, debuggers and linkers and a whole host of programming-language and toolchain-related components. Now heavily used in both academia and industry, where it allows for rapid development of production-quality tools, LLVM is increasingly used in work targeted at high-performance computing. Research in, and implementation of, program analysis, compilation, execution, and profiling have clearly benefited from the availability of a high-quality, freely-available infrastructure on which to build. This workshop will focus on recent developments, from both academia and industry, that build on LLVM to advance the state-of-the-art in high-performance computing.

Workshop

Localization of Gamma-Ray Bursts in a Balloon-Borne Telescope

Performance Optimization

DescriptionMulti-messenger astrophysics combines observations from multiple instruments to study transient astrophysical phenomena, many occurring at seconds-level timescales. To identify and precisely localize these events in the sky, current systems often search through extensive sensor data, requiring resource-intensive computation to achieve results on the timescale of the events themselves. We seek to reduce computational requirements so as to perform real-time event localization with limited computational resources suitable for an orbital platform.

This work studies the performance of a computational pipeline for real-time gamma-ray burst (GRB) detection and localization aboard the Antarctic Demonstrator for the Advanced Particle-astrophysics Telescope (ADAPT), a balloon-borne prototype for a space-based gamma-ray observatory supporting multi-messenger observations. ADAPT observes gamma-ray Compton scattering, then uses the pipeline to combine information from multiple photons to identify a GRB's source direction. We identify, model, and measure key uncertainties, then deploy instrumentation and computational improvements to reduce them, substantially improving localization accuracy.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Lossy and Lossless Compression for BioFilm Optical Coherence Tomography (OCT)

XO/EX

DescriptionOptical Coherence Tomography (OCT) can be used as a fast and non-destructive technology for bacterial biofilm imaging. However, OCT generates approximately 100 GB per flow cell, which complicates storage and data sharing. Data reduction can reduce data complications by reducing the overhead and the amount of data transferred. This work leverages the similarities between layers of OCT images to minimize the data in order to improve compression. This paper evaluates the 5 lossless and 2 lossy state-of-the-art compressors to reduce the OCT data. The reduction techniques are evaluated to determine which compressor has the most significant compression ratio while maintaining a strong bandwidth and minimal image distortion. Results show that SZ with frame before pre-processing is able to achieve the highest CR of 204.6x on its higher error bounds. The maximum compression bandwidth SZ on higher error bounds is ~41MB/s, for decompression bandwidth, it is able to outperform ZFP achieving.

Workshop

Lossy and Lossless Compression for BioFilm Optical Coherence Tomography (OCT)

Data Analysis, Visualization, and Storage

Data Compression

DescriptionOptical Coherence Tomography (OCT) is a fast and nondestructive technology for bacterial biofilm imaging. However, OCT generates approximately 100 GB per flow cell, complicating storage and data sharing. Data reduction reduces data complications by reducing the overhead and amount of data transferred. This work leverages similarities between layers of OCT images to minimize data in order to improve compression. This paper evaluates 5 lossless and 2 lossy state-of-the-art compressors as well as 2 pre-processing techniques to reduce the OCT data. Reduction techniques are evaluated to determine which compressor has the most significant compression ratio while maintaining a strong bandwidth and minimal image distortion. Results show SZ with frame before pre-processing is able to achieve the highest CR of 204.6× on its higher error bounds. The maximum compression bandwidth SZ on higher error bounds is ∼ 41𝑀𝐵/𝑠, for decompression bandwidth, it is able to outperform ZFP achieving ∼ 67𝑀𝐵/𝑠.

Early Career Program

Inclusivity

Lunch

Inclusivity

Students@SC

Lunch Break

TUT

XO/EX

Birds of a Feather

LUSTRE Community BoF: Lustre in HPC, AI, and the Cloud

Data Analysis, Visualization, and Storage

XO/EX

DescriptionLustre is the leading open-source and open-development file system for HPC. Around two thirds of the top 100 supercomputers use Lustre. It is a community developed technology with contributors from around the world. Lustre currently supports many HPC infrastructures beyond scientific research, such as financial services, energy, manufacturing and life sciences. Lustre clients are available for broadly deployed instruction set architectures such as x86, POWER, and ARM.

At this BoF, Lustre developers, administrators, and solution providers will gather to discuss recent Lustre developments and challenges, including the role of Lustre in AI and its use in Cloud environments.

Workshop

Machine Learning Applied to Single-Molecule Activity Prediction

Artificial Intelligence/Machine Learning

DescriptionCatalytic processes are used in about 1/3 of US manufacturing, from the field of chemical engineering to renewable energy. Assessing the activity of single-molecules, or individual molecules, is necessary to the development of efficient catalysts. Their heterogeneity structure leads to particle-specific catalytic activity. Experimentation with single-molecules can be time consuming and difficult. We purpose a Machine learning (ML) model that allows chemical researchers to run shorter single-molecule experiments to obtain the same level of results. We use common and widely understood ML methods to reduce complexity and enable accessibility to the chemical engineering community. We reduce the experiment time by up to 83%. Our evaluation shows that a small data set is sufficient to train an acceptable model. 300 experiments are needed, including the validation set. We use a well understood multi-layer perceptron (MLP) model. We show that more complex models are not necessary, and simpler methods are not sufficient.

Birds of a Feather

Machine Learning from the Data’s Perspective: Data-Centric AI for Scientific Computing

Artificial Intelligence/Machine Learning

XO/EX

DescriptionThis BoF will spotlight the underemphasized role of inputs and data in machine learning (ML), contrasting the prevalent focus on hardware aspects. It invites the SC community to contribute insights in these areas: 1) the value proposition for data-centric AI in scientific computing; 2) foundation models for the long tail of science; 3) the role of benchmarks in data-centric AI. To foster interactive dialogue, we will facilitate discussions, conduct live polling, and arrange short breakout sessions. These activities will enable participants to delve into the practical implications of data-centric AI, benchmarking, and contributing to scientific foundation models.

Tutorial

Magic Castle: Terraforming the Cloud to Teach HPC

Cloud Computing

Middleware and System Software

TUT

DescriptionAre you new to the world of HPC and are trying to find an affordable and accessible way that you can learn, practice and experiment? Do you miss the days when learning about HPC was connecting a few grey boxes together and configuring a cluster? Do you wish you could transfer all the complexity inherent in production HPC systems into an accessible sandbox environment, designed to facilitate teaching and experimental development? Stop wishing and come explore Magic Castle with this tutorial!

Magic Castle is an open-source software that replicates the HPC infrastructure experience via community or commercial cloud resources. It is easy to deploy and can be created in minutes. Once their cluster is deployed, the user is provided with a complete HPC cluster software environment including the scheduler, a data-transfer node, JupyterHub, and thousands of software applications compiled by experts and accessible via CVMFS. Since its initial public release in 2018, Magic Castle has been used for thousands of workshops and tutorials world-wide.

In this tutorial, you will learn how to deploy a virtual HPC cluster on your preferred cloud resource in minutes, and fully customize your environment to suit your application, whether that be training, development, or practice.

Workshop

Making QIR Executable

Quantum Computing

Software Engineering

DescriptionWe present a work-in-progress to create a software toolchain that links the Quantum Intermediate Representation (QIR) to the hardware-agnostic execution framework XACC. The novelty of this work is an implementation of the QIR specification for use in the XACC framework by translating the QIR to an XACC intermediate representation (XIR) and then illustrating how this toolchain will be able to represent both quantum and classical computations. The anticipated impact includes the expansion of quantum programming languages that can be executed on quantum computing hardware and the utilization of QPU hardware agnostic properties of XACC framework.

Tutorial

Managing HPC Software Complexity with Spack

Cloud Computing

Distributed Computing

TUT

DescriptionThe modern scientific software stack includes thousands of packages, from C, C++, and Fortran libraries, to packages written in interpreted languages like Python and R. HPC applications may depend on hundreds of packages spanning all of these ecosystems. To achieve high performance, they must also leverage low-level and difficult-to-build libraries such as MPI, BLAS, and LAPACK. Integrating this stack is extremely challenging. The complexity can be an obstacle to deployment at HPC sites and deters developers from building on each other's work.

Spack is an open source tool for HPC package management that simplifies building, installing, customizing, and sharing HPC software stacks. Its adoption has grown rapidly: it is used by end-users, developers, clouds, and the world's largest HPC centers. Spack provides a powerful and flexible dependency model, a simple Python syntax for writing package build recipes, and a repository of over 7,000 packages maintained by a community of over 1,100 contributors. This tutorial provides an introduction to Spack's capabilities: installing and authoring packages, integrating Spack with development workflows, and deploying software at HPC facilities. Attendees will learn foundational skills for automating day-to-day tasks, as well as deeper knowledge of Spack for advanced use cases.

Workshop

Many Cores, Many Models: GPU Programming Model vs. Vendor Compatibility Overview

Performance Measurement, Modeling, and Tools

Performance Optimization

DescriptionIn recent history, GPUs became a key driver of compute performance in HPC. With the installation of the Frontier supercomputer, they became the enablers of the exascale era; further largest-scale installations are in progress (Aurora, El Capitan, JUPITER). But the early-day dominance by NVIDIA and their CUDA programming model has changed: The current HPC GPU landscape features three vendors (AMD, Intel, NVIDIA), each with native and derived programming models. The choices are ample, but not all models are supported on all platforms, especially if support for Fortran is needed; in addition, some restrictions might apply. It is hard for scientific programmers to navigate this abundance of choices and limits. This presentation gives a guide by matching the GPU platforms with supported programming models, presented in a concise table and further elaborated in detailed comments. An assessment is made regarding the level of support of a model on a platform.

Workshop

Mapping High-Level Concurrency from OpenMP and MPI to ThreadSanitizer Fibers

Applications

Software Engineering

DescriptionHigh-level parallel programming paradigms like MPI and OpenMP allow expressing concurrency independent from the execution unit finally executing the code. Most general-purposed data race detection tools perform thread-centric analysis with the operating system thread as the execution unit. ThreadSanitizer introduced the concept of software fibers as more fine-grained execution units. We use ThreadSanitizer fibers to model the concurrency semantics of OpenMP tasks, MPI non-blocking communication, and MPI one-sided communication. We propose different optimizations regarding handling ThreadSanitizer fibers, resulting in less runtime and memory overhead. In our experiment, we demonstrate how this augmented tool set can be applied to a highly asynchronous application using OpenMP tasking in combination with MPI non-blocking communication. For task-centric data race detection, we observe a moderate runtime overhead of up to 12%. Handling MPI non-blocking communication results in up to 3x runtime overhead. The holistic analysis of the code yields up to 10x runtime overhead.

Workshop

MareNostrum 5: Site Report from BSC

State of the Practice

Workshop

Massively Distributed Finite-Volume Flux Computation

Algorithms

Heterogeneous Computing

Large Scale Systems

DescriptionDesigning large-scale geological carbon capture and storage projects and ensuring safe long-term CO2 containment - as a climate change mitigation strategy - requires fast and accurate numerical simulations. These simulations involve solving complex PDEs governing subsurface fluid flow using implicit finite-volume schemes widely based on Two-Point Flux Approximation (TPFA). This task is computationally and memory expensive, especially when performed on highly detailed geomodels. In most current HPC architectures, memory hierarchy and data management mechanisms are insufficient to overcome the challenges of large scale numerical simulations. Therefore, it is crucial to design algorithms that can exploit alternative and more balanced paradigms, such as dataflow and in-memory computing. This work introduces an algorithm for TPFA computations that exploits a dataflow architecture, such as Cerebras CS-2, which helps to significantly minimize memory bottlenecks. Our implementation achieves two orders of magnitude speedup compared to multiple reference implementations running on latest generations of NVIDIA GPUs.

Students@SC

Master the Art of Crafting and Delivering Your Elevator Pitch Workshop

TUT

XO/EX

DescriptionIn today's fast-paced world, your ability to communicate your value and make a lasting impression in a short time can be a game-changer. Join our workshop, “Finding the Right Elevator Pitch and Practicing It,” and gain the skills to captivate your audience and seize opportunities. The workshop provides information on how to craft a powerful elevator pitch, tailoring your pitch to your audience, and interactive practice sessions.

Tutorial

Mastering Tasking with OpenMP

Accelerators

Software Engineering

Task Parallelism

TUT

DescriptionWith the increasing prevalence of multi-core processors, shared-memory programming models are essential. OpenMP is a popular, portable, widely supported, and easy-to-use shared-memory model. Since version 3.0 released in 2008, OpenMP offers tasking to support the creation of composable parallel software blocks and the parallelization of irregular algorithms. Developers usually find OpenMP easy to learn. However, mastering the tasking concept of OpenMP requires a change in the way developers reason about the structure of their code and how to expose the parallelism of it. Our tutorial addresses this critical aspect by examining the tasking concept in detail and presenting patterns as solutions to many common problems.

We assume attendees understand basic parallelization concepts and know the fundamentals of OpenMP. We present the OpenMP tasking language features in detail and focus on performance aspects, such as introducing cut-off mechanisms, exploiting task dependencies, and preserving locality. All aspects are accompanied by extensive case studies. If accepted as a full-day tutorial, we will include hands-on sessions. Throughout all topics, we present the recent additions of OpenMP 5.1 and 5.2 and comment on the developments targeting OpenMP 6.0.

Workshop

MatRIS: Multilevel Math Library Abstraction for Heterogeneity and Performance Portability Using IRIS Runtime

Performance Measurement, Modeling, and Tools

Performance Optimization

DescriptionVendor libraries are tuned for a specific architecture and are not portable to others. Moreover, they lack support for heterogeneity and multi-device orchestration, which is required for efficient use of contemporary HPC and cloud resources. To address these challenges, we introduce MatRIS-a multilevel math library abstraction for scalable and performance-portable sparse/dense BLAS/LAPACK operations using IRIS runtime. The MatRIS-IRIS co-design introduces three levels of abstraction to make the implementation completely architecture agnostic and provide highly productive programming. We demonstrate that MatRIS is portable without any change in source code and can fully utilize multi-device heterogeneous systems by achieving high performance and scalability on Summit, Frontier, and a CADES cloud node equipped with four NVIDIA A100 GPUs and four AMD MI100 GPUs. A detailed performance study is presented in which MatRIS demonstrates multi-device scalability. When compared, MatRIS provides competitive and even better performance than libraries from vendors and other third parties.

Workshop

Maximizing Data Utility for HPC Python Workflow Execution

Applications

Distributed Computing

Large Scale Systems

Programming Frameworks and System Software

Runtime Systems

DescriptionLarge-scale HPC workflows are increasingly implemented in dynamic languages such as Python, which allow for more rapid development than traditional techniques. However, the cost of executing Python applications at scale is often dominated by the distribution of common datasets and complex software dependencies. As the application scales up, data distribution becomes a limiting factor that prevents scaling beyond a few hundred nodes. To address this problem, we present the integration of Parsl (a Python-native parallel programming library) with TaskVine (a data-intensive workflow execution engine). Instead of relying on a shared filesystem to provide data to tasks on demand, Parsl is able to express advance data needs to TaskVine, which then performs efficient data distribution at runtime. This combination provides a performance speedup of 1.48x over the typical method of on-demand paging from the shared filesystem, while also providing an average task speedup of 1.79x with 2048 tasks and 256 nodes.

Paper

MBFGraph: An SSD-Based External Graph System for Evolving Graphs

Cloud Computing

Data Analysis, Visualization, and Storage

Graph Algorithms and Frameworks

DescriptionThe challenge of executing extensive graph analyses in-memory intensifies with growing graph sizes. This has given rise to disk-based external graph analytics systems that prioritize cost-effective HDDs/SSDs over pricier memory solutions. In response to this issue, our paper introduces and assesses the MBFGraph external graph system. This system leverages millions of Bloom filters within 1KB or 2KB graph data blocks to diminish graph analysis execution delays. Through our innovative MBF-query and MBF-construct algorithms, MBFGraph utilizes these Bloom filters as approximate indices, enabling the reading of only pertinent sections of dynamic graph data, thereby facilitating scalable analytics. Our tests revealed that, on a 475GB graph, MBFGraph cut down the execution duration of BFS and Pagerank by 24% and 60% respectively, using a mere 4GB memory. This is in comparison to a sequential, tailored-for-workload, disk-based external graph analytics system.

Birds of a Feather

Meeting HPC Community Needs: How SIGHPC, TCPP, and SIAG-SC Join Efforts to Engage Communities and Deliver Services

HPC in Society

XO/EX

DescriptionCome and learn from the leaders of the professional societies focused on HPC from ACM, IEEE, and SIAM! Your SIGHPC, TCPP, and SIAG-SC representatives invite SC23 participants to join this cross-society BoF to learn about joint societies' efforts to promote collaborations, discuss the status of HPC as a community, and engage the audience to address common challenges.

Workshop

Memory Transfer Decomposition: Exploring Smart Data Movement through Architecture-Aware Strategies

Accelerators

Compilers

Data Movement and Memory

Heterogeneous Computing

Performance Optimization

Programming Frameworks and System Software

Runtime Systems

DescriptionWe provide an automated framework that utilizes complex hardware links while preserving the simplified abstraction level for the user. Through the decomposition of user-issued memory operations into architecture-aware sub-tasks, we automatically exploit generally underused connections of the system. The operations we support include moving, distribution, and consolidation of memory across the node. For each of them, our Auto-Strategyzer framework proposes a task graph that transparently improves performance, in terms of latency or bandwidth, compared to naive strategies. For our evaluation, we integrated the Auto-Strategyzer as a C++ library into the LLVM-OpenMP runtime infrastructure. We demonstrate that some memory operations can be improved by a factor of 5x compared to naive versions. Integrated into LLVM/OpenMP, our Auto-Strategyzer accelerates cross-device memory movement by a factor of 1.9x, for large transfers, resulting in approx 6% end-to-end execution time decrease for a scientific proxy application.

Workshop

MEMQSim: Highly Memory-Efficient and Modularized Quantum State-Vector Simulation

Quantum Computing

Software Engineering

DescriptionWe introduce a highly memory-efficient state vector simulation of quantum circuits premised on data compression, harnessing the capabilities of both CPUs and GPUs. We have elucidated the inherent challenges in architecting this system, while concurrently proposing our tailored solutions. Moreover, we have delineated our preliminary implementation and deliberated upon the potential for integration with other GPU-oriented simulators. In forthcoming research, we aim to present a more comprehensive set of results, bolstering the assertion of the efficacy and performance of our approach.

Posters

Research Posters

Minimizing Data Movement Using Distant Futures

XO/EX

DescriptionScientific workflows execute a series of tasks where each task may consume data as an input and produce data as an output. Within these workflows, tasks often produce intermediate results that may serve as inputs to subsequent tasks within the workflow. These results can vary in size and may need to be transported to another worker node. Data movement can become the primary bottleneck for many scientific workflows thus minimizing the cost of data movement can provide a significant performance benefit for a given workflow. Distant futures enable transfers between worker nodes, eliminating the need for intermediate results to pass through a centralized manager for future tasks invocations. Additionally, asynchronous transfers enable increased concurrency by preventing the blocking of task invocations. This poster shows the performance benefit received from the implementation of distant futures within a workflow that produces numerous intermediate results.

Paper

Mirage: Toward Low-interruption Services on Batch GPU Clusters with Reinforcement Learning

Architecture and Networks

Performance Measurement, Modeling, and Tools

Resource Management

DescriptionAccommodating long-running deep learning (DL) training and inference jobs is challenging on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock time limits, DL researchers usually need to run a sequence of batch jobs and experience long interruptions on overloaded machines. Such interruptions significantly lower the research productivity and QoS for services that are deployed in production. To mitigate the issues from interruption, we explore a set of machine learning and reinforcement learning techniques to design a proactive provisioner. We examine the generality of the method using production job traces from three GPU clusters. We validate the effectiveness and generality of our proactive provisioner using the validation trace of each cluster. Our experiments show that the proposed resource provisioner safeguards 23%-76% of jobs with zero interruption across varying load levels on the three clusters.

Paper

Mitigating Coupling Map Constrained Correlated Measurement Errors on Quantum Devices

Post-Moore Computing

Quantum Computing

DescriptionWe introduce a technique for the suppression of state-dependent and correlated measurement errors, which are commonly observed on modern superconducting quantum devices. Our method leverages previous results, establishing that correlated errors tend to be physically localized on quantum devices to perform characterizations over the coupling map of the device, and to join overlapping measurement calibrations as a series of sparse matrices. We term this "Coupling Map Calibration". We quantitatively demonstrate the advantages of our proposed error mitigation system design across a range of current IBM quantum devices. Our experimental results on common benchmark circuits demonstrate up to a 41% reduction in the error rate without increasing the number of executions of the quantum device required when compared to conventional error mitigation methods.

Birds of a Feather

Mixed Feelings about Mixed Precisions

Applications

XO/EX

DescriptionWhat if we have been oversolving in computational science and engineering for decades? Are low precision arithmetic formats only for AI workloads? How can HPC applications exploit mixed-precision hardware features? This BoF invites the HPC community at large interested in applying mixed precisions into their workflows and discussing the impact on time-to-solution, memory footprint, data motion, and energy consumption. Experts from scientific applications/software libraries/hardware architectures will briefly provide the context on this trendy topic, share their own perspectives, and mostly engage with the audience via a set of questions, while gathering feedback to define a roadmap moving forward.

Workshop

Mixed-Precision S/DGEMM Using the TF32 and TF64 Frameworks on Low-Precision AI Tensor Cores

Applications

Software Engineering

DescriptionUsing NVIDIA Tensor Cores has enabled the significant acceleration of general matrix multiplication for applications in AI and in high-performance computing. The use of such specialized accelerators can provide a performance increase between 8x and 20x, albeit with a loss in precision. However, higher precisions are required in many applications. Fortunately, mixed-precision methods can be employed to maintain a high precision while also taking advantage of the performance of lower-precision AI cores. We extend the state of the art by using NVIDIA’s new TF32 framework, which not only burdens some constraints of the previous frameworks but also provides an equivalent precision and performance by using a much simpler approach. We also propose a new framework called TF64 that attempts double-precision arithmetic with low-precision Tensor Cores. Although this framework does not exist yet, we validated the correctness of this idea and achieved an equivalent of 64-bit precision on 32-bit hardware.

Workshop

ML Movie Night: A Pilot Machine Learning Course for High-School Students and Implications for Undergraduate Adaptation

Education

State of the Practice

Sustainability

DescriptionThis course introduces high-school students to machine learning and NLP concepts using high-performance computing resources. Age-appropriate teaching strategies and a rapid shift to self-directed learning are emphasized. The current course diverges from undergraduate coursework in both its format and intended audience. First, the course occurs within the context of a two-week science and technology summer program called the Summer Institute (SI), sponsored by the Ohio Supercomputer Center to expand educational opportunities surrounding high-performance computing and STEM fields more generally. The compressed timeline requires teaching adaptations and a limited scope. Secondly, students are high-school aged (9-12th grades), and the educational aims are engagement and recruitment as much as content learning. Extra attention is placed on making the course age-appropriate and age-accessible; applicability of these strategies to undergraduate-level pedagogy is discussed. Positive student feedback and well-executed oral presentations demonstrate successful learning outcomes in both engagement and content learning.

Workshop

MLG-HPCE – Afternoon Break

Artificial Intelligence/Machine Learning

Graph Algorithms and Frameworks

Birds of a Feather

MLPerf: A Benchmark for Machine Learning

Artificial Intelligence/Machine Learning

XO/EX

DescriptionMachine learning applications are rapidly expanding into scientific domains and challenging the hallmarks of traditional high performance computing workloads. We present MLPerf, a community-driven system performance benchmark which spans a range of machine learning tasks. The speakers at this BoF are experts in the fields of HPC, science applications, machine learning, and computer architecture, representing academia, government research organizations, and private industry. In this session, we will cover the past year’s development within the MLPerf organization and provide an update on the latest round of submissions to MLPerf-HPC benchmark suite to solicit input from interested parties within the HPC community.

Workshop

Modeling Data Locality of Sparse Matrix-Vector Multiplication on the A64FX

Modeling and Simulation

Performance Measurement, Modeling, and Tools

DescriptionOne of the novel features of the Fujitsu A64FX CPU is the sector cache. This feature enables hardware-supported partitioning of the L1 and L2 caches and allows the programmer control of which partition is used to place data in. This paper performs an in-depth study of how to apply the sector cache to a frequently used sparse matrix-vector multiplication (SpMV) kernel. A performance model based on reuse analysis is used to better understand situations where the sector cache leads to improved reuse and to predict the cache behavior. The model correctly predicts the number of L2 cache misses within 2–3 % for sequential and parallel SpMV with 48 threads using a collection of 490 sparse matrices. Further experiments show the effect of various sector cache configurations on performance. A median speedup of about 1.05× is achieved, whereas the maximum speedup is about 1.6×.

Posters

Research Posters

Modeling Parallel Programs Using Large Language Models

XO/EX

DescriptionIn the past year a large number of large language model (LLM) based tools for software development have been released. These tools have the capability to assist developers with many of the difficulties that arise from the ever-growing complexity in the software stack. As we enter the exascale era, with a diverse set of emerging hardware and programming paradigms, developing, optimizing, and maintaining parallel software is becoming burdensome for developers. While LLM-based coding tools have been instrumental in revolutionizing software development, mainstream models are not designed, trained, or tested on High Performance Computing (HPC) problems. We present a LLM fine-tuned on HPC data and demonstrate its effectiveness in HPC code generation, OpenMP parallelization, and performance modeling.

Doctoral Showcase

Posters

Modernizing Simulation Software for the Exascale Era

Heterogeneous Computing

Software Engineering

DescriptionModern HPC hardware is becoming increasingly heterogeneous and diverse in the exascale era. The diversity of hardware and software stacks adds additional development challenges to high performance simulations. One common development approach is to re-engineer the code for each new target architecture in order to maximize performance. However, this re-engineering effort is no longer practical due to increasing heterogeneous hardware. Adding support for a single family of GPUs alone poses a significant challenge. Supporting each major vendor's hardware and software stacks takes valuable developer time away from optimizing and enhancing simulation capabilities. Moving forward, the community must modernize the code development process in order to achieve the greatest scientific output.

In this work, we examine the challenges posed by emerging heterogeneous hardware. These challenges include developing performance portable code, leveraging hardware features targeting AI/ML for HPC applications, and difficulties managing limited I/O resources while checkpointing. To address these challenges we present a modernization approach for scientific software that ensures the following. Attain high performance and portability across architectures using the Kokkos portability framework in addition to optimizations to memory layout, sorting algorithms, and vectorization. Leverage alternative number formats such as half-precision and fixed-point to maximize usage of the limited memory on GPUs and enable larger simulations. Reduce IO overhead and storage requirements through the identification and elimination of spatial-temporal redundancy in application data.

Birds of a Feather

Modular, Container, and Pallet Racking – for the Nex Gen Data Center?

Energy Efficiency

State of the Practice

Sustainability

XO/EX

DescriptionModular and container-based industrial structures for HPC buildings are now common. Resulting CapEx reductions include shorter design-build schedules, and commodity pricing of the structural envelope, and flexibility for expansion and upgradability are enhanced. Typical HPC life cycles for power, cooling and compute machinery are highly varied and require constant modification and renovation of facilities. Commodity structures can reduce this problem. Replacing concrete with steel, creating vertically stacked compute racks, might allow 3-D cube compute architectures with low latency communication and high accessibility for servicing. The transition from air to liquid cooling will drive this change.

Workshop

Moment Representation of Regularized Lattice Boltzmann Methods on NVIDIA and AMD GPUs

Algorithms

Heterogeneous Computing

Large Scale Systems

DescriptionThe lattice Boltzmann method is a highly scalable Navier-Stokes solver that has been applied to flow problems in a wide array of domains. However, the method is bandwidth-bound on modern GPU accelerators and has a large memory footprint. In this paper, we present new 2D and 3D GPU implementations of two different regularized lattice Boltzmann methods, which are not only able to achieve an acceleration of ~1.4× w.r.t. reference lattice Boltzmann implementations but also reduce the memory requirements by up to 35% and 47% in 2D and 3D simulations respectively. These new approaches are evaluated on NVIDIA and AMD GPU architectures.

Workshop

Artificial Intelligence/Machine Learning

Distributed Computing

Workshop

Graph Algorithms and Frameworks

Linear Algebra

Programming Frameworks and System Software

State of the Practice

Workshop

Architecture and Networks

Early Career Program

Inclusivity

Inclusivity

Workshop

MPI Performance Analysis in Vlasiator: Unraveling Communication Bottlenecks

Architecture and Networks

Data Movement and Memory

Resource Management

Posters

Research Posters

XO/EX

DescriptionVlasiator is a popular and powerful massively parallel code for accurate magnetospheric and solar wind plasma simulations. This work provides an in-depth analysis of Vlasiator, focusing on MPI performance using the Integrated Performance Monitoring (IPM) tool. We show that MPI non-blocking point-to-point communication accounts for most of the communication time. The communication topology shows a large number of MPI messages exchanging data in a six-dimensional grid. We also show that relatively large messages are used in MPI communication, reaching up to 256MB. As a communication-bound application, we found that using OpenMP in Vlasiator is critical for eliminating intra-node communication. Our results provide important insights for optimizing Vlasiator for the upcoming Exascale machines.

Workshop

MPI-RICAL: Data-Driven MPI Distributed Parallelism Assistance with Transformers

Artificial Intelligence/Machine Learning

Software Engineering

DescriptionMessage Passing Interface (MPI) plays a crucial role in distributed memory parallelization across multiple nodes. However, parallelizing MPI code manually, and specifically, performing domain decomposition, is a challenging, error-prone task. In this paper, we address this problem by developing MPI-RICAL, a novel data-driven, programming-assistance tool that assists programmers in writing domain decomposition based distributed memory parallelization code. Specifically, we train a supervised language model to suggest MPI functions and their proper locations in the code on the fly. We also introduce MPICodeCorpus, the first publicly available corpus of MPI-based parallel programs that is created by mining more than 15,000 open-source repositories on GitHub. Experimental results have been done on MPICodeCorpus and more importantly, on a compiled benchmark of MPI-based parallel programs for numerical computations that represent real-world scientific applications. MPI-RICAL achieves F1 scores between 0.87-0.91 on these programs, demonstrating its accuracy in suggesting correct MPI functions at appropriate code locations.

Workshop

MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators

Distributed Computing

Middleware and System Software

Runtime Systems

DescriptionThe evolution of high-performance computing toward diverse accelerators, including NVIDIA, AMD, Intel GPUs, and Habana Gaudi Accelerators, demands a user-friendly and efficient utilization of these technologies. While both GPU-aware MPI libraries and vendor-specific communication libraries cater to communication requirements, trade-offs emerge based on library selection across various message sizes. Thus, prioritizing usability, we propose MPI-xCCL, a Message Passing Interface-based runtime with cross-accelerator support for efficient, portable, scalable, and optimized communication performance. MPI-xCCL incorporates vendor-specific libraries with GPU-aware MPI runtimes ensuring multi-accelerator compatibility while adhering to MPI standards. The proposed hybrid designs leverage the benefits of MPI and xCCL algorithms and transparently to the end user. We evaluated our designs on various HPC systems using OSU Micro-Benchmarks, and Deep Learning frameworks TensorFlow with Horovod. On NVIDIA-GPU-enabled ThetaGPU, our designs outperformed Open MPI by 4.6x. On emerging Habana Gaudi-based systems, MPI-xCCL was also able to deliver similar performance as vendor-provided communication runtimes.

Birds of a Feather

MPICH: A High Performance Open-Source MPI Implementation

Programming Frameworks and System Software

XO/EX

DescriptionMPICH is a widely used, open-source implementation of the MPI message passing standard. It has been ported to many platforms and used by several vendors and research groups as the basis for their own MPI implementations. This BoF session will provide a forum for users of MPICH as well as developers of MPI implementations derived from MPICH to discuss experiences and issues in using and porting MPICH. Future plans for MPICH will be discussed. Representatives from MPICH-derived implementations will provide brief updates on the status of their efforts. MPICH developers will also be present for an open forum discussion.

Workshop

MSR-genie: Navigating Model Specific Registers across Processor Generations

Programming Frameworks and System Software

DescriptionPerformance tuning of High-Performance Computing (HPC) applications depends on sophisticated tuning of parameters on diverse architectures. These parameters are made available by vendors through low-level dials such as Model-Specific Registers (MSRs). While the MSRs themselves provide a powerful mechanism for users to monitor and control processor features, accessing them is laborious due to lack of standard interfaces and clear documentation. As a result, the burden of determining which MSRs to consider and how to fine-tune them for an application lies on the end user.

We present MSR-genie, an efficient and extensible query tool which reduces this user-level burden and allows them to query bidirectionally across MSR lists as well as a processor families and models, and providing them with guidance on appropriate bitmasks. The MSR-genie tool is open-source and easily extensible, and we demonstrate its effectiveness with over thirty Intel processor models and over two-thousand unique MSRs.

Workshop

MTSA – Afternoon Break

Applications

Data Movement and Memory

Large Scale Systems

Students@SC

Navigating Career Waters: Finding Your Perfect Career Fit across Industries

TUT

XO/EX

DescriptionIn today's dynamic job market, discovering the ideal career path can be exciting and challenging. Our diverse panel of experts is here to guide you through the process, regardless of your background or career stage. Join us for a fascinating discussion and gain insights into your career journey across diverse industries!

Birds of a Feather

Navigating Complexity: Achieving Performance Portability in the Evolving Landscape of Heterogeneous HPC Systems

State of the Practice

XO/EX

DescriptionWith increasing demand for AI in HPC, there has been an explosion in architectures, programming models, and AI frameworks. The already-daunting task of programming for heterogenous systems has become even more challenging. This BoF, organized by the IXPUG but not limited to Intel technology, will focus on portable programming across a wide variety of architectures running a diverse set of HPC, and AI workloads.

This BoF will explore challenges, state-of-the-art approaches, and emergent best practices for programming across heterogeneous systems and novel architectures, identifying common principles and practices that enable development and maintenance of software across sites, architectures, and applications.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Navigating the Molecular Maze: A Python-Powered Approach to Virtual Drug Screening

DescriptionThe COVID-19 pandemic has highlighted the power of using computational methods for virtual drug screening. However, the molecular search space is enormous and the common protein docking methods are still computationally intractable without access to the world’s largest supercomputers. Instead, researchers are using AI methods to provide a powerful new tool to help guide docking campaigns. In such approaches, a lightweight surrogate model is trained and then used to identify promising candidates for screening. We present ParslDock, a Python-based pipeline using the Parsl parallel programming library and the K-Nearest Neighbors machine learning model to screen a huge molecular space of molecules against arbitrary receptors. We achieved a 38X speedup with ParslDock compared to a brute-force docking approach.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Near-Optimal Reduce on the Cerebras Wafer-Scale Engine

DescriptionEfficient reduce and allreduce communication collectives are crucial building blocks in many workloads, including deep learning training, and have been optimized for various architectures. We provide the first systematic investigation of the reduce operation on the Cerebras Wafer-Scale Engine (WSE) using the Cerebras SDK. We improve upon existing reduce implementations by up to 5x in certain settings. We show that using at most three different implementations we can achieve performance at most 1.38x slower than an optimal reduction tree. Finally, we provide an allreduce that outperforms patterns like ring or butterfly by up to 2x.

We will (a) cover unique features of the Cerebras WSE, (b) introduce a model to accurately predict performance on the hardware, (c) discuss different reduce implementations, (d) analyze the results of running them using an accurate simulator and compare them against an optimal reduction tree, (e) show how to extend them to an efficient allreduce.

Posters

Research Posters

NeoRodinia: Evaluation of High-Level Parallel Programming Models and Compiler Transformation for GPU Offloading

XO/EX

DescriptionNeoRodinia is a comprehensive benchmark suite developed from Rodinia, containing 23 real-world applications and 5 microbenchmarks. It addresses the limitations of Rodinia by optimizing OpenMP GPU offloading programs and introducing OpenACC variants. The evaluation involves thorough performance assessments on various hardware architectures and compilers, measuring execution time and memory usage. These evaluations offer valuable insights into parallel programming models and compiler choices, guiding optimization efforts and helping developers, especially beginners to make informed decisions.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

NetCDFaster: A Geospatial Cyberinfrastructure for Multi-Dimensional Scientific Datasets Full-Stack I/O and Visualization

XO/EX

DescriptionNetCDF's original design included a portable file format and an intuitive application programming interface (API). However, the current NetCDF framework and its derived libraries lack efficient support for querying and visualizing data subsets with low memory use and time cost. Therefore, a full-stack solution to handle and display multidimensional data frames in NetCDF must be developed to meet the research needs. In this project, a next-generation full-stack tool, “NetCDFaster,” was developed to accelerate the reading and viewing of NetCDF data. This tool was derived from serial and parallel interfaces built on MPI-IO. The test results showed that processing time and memory usage were significantly improved than conventional methods.

Tutorial

Networking Technologies for High-Performance Computing: Principles and Solutions

Architecture and Networks

Distributed Computing

TUT

DescriptionInfiniBand (IB), High-speed Ethernet (HSE), RoCE, Omni-Path, EFA, Tofu, and Slingshot technologies are generating a lot of excitement towards building next generation High-End Computing (HEC) systems including clusters, datacenters, filesystems, storage, cloud computing, Big Data (Spark) and AI (Deep Learning and Machine Learning) environments. This tutorial will provide an overview of these emerging technologies, their offered architectural features, their current market standing, and their suitability for designing HEC systems. It will start with a brief overview of IB, HSE, RoCE, Omni-Path, EFA, Tofu, and Slingshot. In-depth overview of the architectural features of IB, HSE (including iWARP and RoCE), and Omni-Path, their similarities and differences, and the associated protocols will be presented. An overview of the emerging NVLink2, NVSwitch, AMD Infinity Fabric, Slingshot, and Tofu architectures will also be given. Next, an overview of the OpenFabrics stack and Libfabrics software stack to support a range of different interconnects will be provided. Hardware/software solutions and the market trends behind these networking technologies will be highlighted. Sample performance numbers of these technologies and protocols for different environments will be presented. Finally, hands-on exercises will be carried out for the attendees to gain first-hand experience of running experiments with high-performance networks.

Posters

Research Posters

Neural Domain Decomposition for Variable Coefficient Poisson Solvers

XO/EX

DescriptionThe computational bottleneck in many fluid simulations arises from solving the variable coefficient Poisson equation. To tackle this challenge, we propose a novel neural domain decomposition algorithm to accelerate its solution. Our approach hinges on two key ideas: first, using neural PDE solvers to approximate the solutions within subdomains, and second, ensuring continuity across subdomain boundaries by solving a Schur complement system derived from the cell-centered discretized Poisson equation. A distinct advantage of our approach lies in generating a large dataset consisting only of small-scale problems to train the subdomain solver. This trained model can subsequently be applied to problems with large and complex geometries. Moreover, by batching the independent subdomain solves, we achieve high GPU utilization with neural solvers compared to state-of-the-art numerical methods. In contrast to neural domain decomposition algorithms that rely on Schwarz overlapping methods, our optimization-based approach, coupled with neural PDE solvers, improves accuracy and performance.

Birds of a Feather

New Competences: Are We Ready for the Uptake of Exascale and Hybrid Quantum-Classical Computing?

State of the Practice

XO/EX

DescriptionExascale computing (EC) can process larger quantities of data faster than ever before and the technologies being developed can help accelerate innovation across the economy. Quantum-classical hybrid solutions have already gone beyond research environments into the business spheres. The first-generation EC projects in the USA and UK are soon ending.

Which tools and environments are emerging as most sought-after? How ready are we to answer the skills needs of computational researchers and business users? Do we have a clear competence framework? What are the needed skills to harness the promise and potential of emerging technologies?

Workshop

New Root Emulation Mode for Charliecloud Using seccomp

DescriptionCharliecloud, LANL’s lightweight unprivileged container implementation, has a new root emulation mode as of version 0.32. We use this to tell programs, which are usually distro package managers, they have real root privileges even though they are running as a normal (although containerized) user. Our new mode uses the kernel’s seccomp(2) system call filtering to first construct a BPF program that specifies allowed system calls. It then intercepts certain privileged system calls, does absolutely nothing and returns success to the program.

The advantages of this new mode is that it is simpler, faster, completely neutral to libc and mostly neutral to distributions. The disadvantage is that it is that even the most hasty consistency checks will fail as most programs seem to not do any checks at all. For the few programs that do check and do apt/apt-get, it offers a hook to prevent certain programs from asking for it.

This lightning talk will discuss how this new root emulation mode uses the kernel’s seccomp filter to create a new fully unprivileged container build approach, along with its advantages and disadvantages.

Exhibitor Forum

Next Arm Processor FUJITSU-MONAKA and Its Technologies

Architecture and Networks

Data Movement and Memory

Hardware Technologies

XO/EX

DescriptionFujitsu, with over 60 years of processor development history, developed A64FX, which was employed in the Supercomputer Fugaku. Fugaku has significantly contributed to accelerating HPC simulations with its high performance and energy efficiency. However, with the practicality of Artificial Intelligence, there is an increasing need for high computing power to process various workloads in both cloud and edge environments. Additionally, the need for customer data protection is also increasing due to the use of AI. To address these issues, Fujitsu is leveraging its experience in A64FX development and has started developing FUJITSU-MONAKA, a many-core CPU based on the Arm instruction set architecture with Scalable Vector Extension version 2 (SVE2). FUJITSU-MONAKA aims to deliver high AI inference performance with superior energy efficiency to be achieved by Fujitsu's own technology. The goal is to achieve 10 times the power performance of A64FX. Furthermore, FUJITSU-MONAKA will support confidential computing to protect customer data in memory by processor hardware. In this presentation, we will discuss Fujitsu's cutting-edge technologies that will be applied to FUJITSU-MONAKA, which will solve challenges facing the AI era.

Workshop

Next Generation Pathways to Computing: Bridging the Diversity Gap in High-Performance Computing Education

Education

State of the Practice

Sustainability

DescriptionComputer and computational science are pivotal within the evolving STEM landscape. The projected growth of STEM careers, especially in computing, underscores their significance. However, the underrepresentation of minorities and women in computing fields remains a challenge. Oak Ridge National Laboratory (ORNL) hosted one of five U.S. Department of Energy Workforce Development for Teachers and Scientists Pathways Summer Schools in Summer 2023. The Next Generation Pathways to Computing (NGP) program brings high school students to ORNL to learn about careers in computing and work towards inspiring diverse participation. The five-week program curriculum imparted foundational coding skills, practical insights into HPC, and allowed experiential learning under ORNL staff guidance. NGP reached out to underserved schools in the East Tennessee region by offering resources for equitable access. NGP exemplified a comprehensive approach to bridging the diversity gap in computing, nurturing a future generation of STEM leaders equipped with essential skills and diverse perspectives.

Paper

NNQS-Transformer: An Efficient and Scalable Neural Network Quantum States Approach for Ab Initio Quantum Chemistry

Applications

Modeling and Simulation

DescriptionNeural network quantum state (NNQS) has emerged as a promising candidate for quantum many-body problems, but its practical applications are often hindered by the high cost of sampling and local energy calculation. We develop a high-performance NNQS method for ab initio electronic structure calculations. The major innovations include:

(1) A transformer based architecture as the quantum wave function ansatz;

(2) A data-centric parallelization scheme for the variational Monte Carlo (VMC) algorithm which preserves data locality and well adapts for different computing architectures;

(3) A parallel batch sampling strategy which reduces the sampling cost and achieves good load balance;

(4) A parallel local energy evaluation scheme which is both memory and computationally efficient;

(5) Study of real chemical systems demonstrates both the superior accuracy of our method compared to state-of-the-art and the strong and weak scalability for large molecular systems with up to 120 spin orbitals.

Tutorial

Node-Level Performance Engineering

Accelerators

Performance Optimization

TUT

DescriptionThe gap between peak performance and application performance is continuing to open. Paradoxically, bad node-level performance leads to highly scalable code, but at the price of increased overall time to solution. Consequently, valuable resources are wasted, often on a massive scale. If the user cares about time to solution on any scale, optimal performance on the node level is often the key factor. We convey the architectural features of current processor chips, multiprocessor nodes, and accelerators, as far as they are relevant for the practitioner. Peculiarities like SIMD vectorization, shared vs. separate caches, data transfer bottlenecks, and ccNUMA characteristics are introduced, and the influence of system topology and affinity on the performance of typical parallel programming constructs is demonstrated. Performance engineering and performance patterns are suggested as powerful tools that help the user understand the bottlenecks at hand and to assess the impact of possible code optimizations. A cornerstone of these concepts is the roofline model, which is described in detail, including useful case studies, limits of its applicability, and possible refinements. We also show how simple performance tools can support node-level performance analysis by providing the developer with useful information about the bottlenecks of their code.

Workshop

NPAT - A Power Analysis Tool at NERSC

Programming Frameworks and System Software

DescriptionPower has recently become a significant limiting factor in supercomputing. It is imperative that future high-performance computing (HPC) systems are energy-efficient. Efficient system designs necessitate understanding the power signature of current supercomputing workloads. Additionally, that understanding will enable power efficiency improvements in existing systems. With advancements in power measurement and data collection techniques, many computing centers, including NERSC, collect a vast amount of data every second, including power usage and other performance metrics. However, access is not straightforward, being a time-consuming process.

We propose NPAT, a power analysis tool that aims to provide easy access to power usage data on the web. It provides a quick and accessible way to view the power usage data of NERSC systems, jobs, and applications. Being implemented in PHP, Javascript, and Python with open-source libraries and modules, it promises effortless portability to other sites.

Exhibits

SCinet

NRE - 400G 800G Tbps WANs for Science (StarLight/MREN/iCAIR/Ciena

TUT

XO/EX

Exhibits

SCinet

NRE - 5G on the Showfloor

TUT

XO/EX

Exhibits

SCinet

NRE - 5G on the Showfloor

TUT

XO/EX

Exhibits

SCinet

NRE - AmLight 2.0: Flexible control, deep visibility, and programmability @ Tbps!

TUT

XO/EX

Exhibits

SCinet

NRE - Global Research Platform (StarLight/MREN/iCAIR)

TUT

XO/EX

Exhibits

SCinet

NRE - Multi-site data streaming orchestration with SciStream

TUT

XO/EX

Exhibits

SCinet

NRE - NOTED and Scitags (LHC Networking, CERN)

TUT

XO/EX

Exhibits

SCinet

NRE - NRE-12 Real-Time Entropy and Frequency Moment Estimation on FPGA

TUT

XO/EX

Exhibits

SCinet

NRE - NRE-26 Packet Drop Monitoring with SFlow

TUT

XO/EX

Exhibits

SCinet

NRE - Performance Evaluation of Data Transfer Nodes for research data sharing

TUT

XO/EX

Exhibits

SCinet

NRE - PolKA routing approach to support traffic engineering for data-intensive science

TUT

XO/EX

Exhibits

SCinet

NRE - Quantum Networks: A Reality

TUT

XO/EX

Exhibits

SCinet

NRE - SENSE and Rucio/FTS/XRootD Interoperation (NRE-015)

TUT

XO/EX

Exhibits

SCinet

NRE - The Global Network Advancement Group (GNA-G)

TUT

XO/EX

Exhibits

SCinet

NRE - The GNA-G Data Intensive Sciences Working Group

TUT

XO/EX

Exhibits

SCinet

NRE - Toward Fully-Automated Network Configuration Management for Large-Scale Science Networks (Xiamen Univesity, China)

TUT

XO/EX

Exhibits

SCinet

NRE - Uncompressed 8K video processing using SRv6-based service function chaining between Japan and U.S.

TUT

XO/EX

Exhibits

SCinet

NRE.- Global P4 Lab

TUT

XO/EX

Exhibitor Forum

NVMe Over CXL (NVMe-oC): An Ultimate Optimization of Host-Device Data Movement

Architecture and Networks

Data Movement and Memory

Hardware Technologies

XO/EX

DescriptionCompute Express Link (CXL), introduced in 2019, manages the Host-Accelerator coherency and the Host-Memory interface. CXL fabric further enables the disaggregated memory architecture. Most of the CXL developments are on the memory interface and not on the storage interface. In this paper, Wolley evaluates the impact of CXL to the storage interface.

NVMe protocol moves the data in a block form from a Device to a Host memory utilizing the PCIe as the transport. There are several attempts to minimize such Host-Device data movement which is an important factor of performance bottleneck and power consumption. One such effort is Computational Storage that moves the compute from the Host to the Device, and the Device only sends the computed result back to the Host at a much lower data rate.

Wolley proposes using NVMe over CXL (NVMeoC) to optimize the Host-Device data movement. In most applications, the Host only accesses a small portion of the entire block data retrieved from the Device. With NVMeoC, the Device keeps a CXL staging area that is managed as a part of the Host memory. Once the block data is moved to the CXL staging memory through NVMe operation, the actual Host-Device data movement using CXL.mem is just a fraction of the total block data. Wolley will compare in details of the latency and the power consumption between the NVMe over PCIe and NVMe over CXL in several different of applications.

Workshop

NVMe-Backed GNN Training on GPU Leveraging a Paged UVM Memory System

Accelerators

Edge Computing

Heterogeneous Computing

DescriptionGraph Neural Networks (GNNs) are powerful machine learning models that learn on graph data by extracting embeddings that represent vertex and edge features, as well as graph topology. With graph data scale increasing, and high memory pressure generated from GNN feature data, we turn to out-of-core training methods on many real world graphs. Current state-of-the-art methods for large-graph GNN training leverage mini-batches, distributed or parallel environments, and memory-aware partitioning and sampling. These methods however require custom training architectures and pipelines. Here, we propose Kirin, a framework for large-graph out-of-core training on a single machine with a single GPU on pre-sampled graphs. Kirin leverages Dragon-direct, allowing for NVMe-backed tensors for out-of-core training through driver managed allocations. Building on UVM, Dragon-direct utilizes a page-based unified memory system, resulting in memory-management that is largely invisible to the user. We showcase Kirin and analyze its performance and effectiveness for GNN workloads.

Workshop

OctoRay: Framework for Scalable FPGA Cluster Acceleration of Python Big Data Applications

Architecture and Networks

DescriptionWe introduce OctoRay, a Python framework which allows users to easily combine several FPGA and Python libraries to run their big data analytics pipelines in parallel on FPGA-enabled clusters. We show that OctoRay provides users with multiple levels of freedom, regarding the type of FPGA, the choice of the application and the number of available FPGAs per node, while illustrating the simplicity of the usage of the framework. Additionally, we evaluate the performance of the framework on two use cases that represent typical stages of many big data analytics pipelines. We show that the FPGA implementation outperforms the software baseline by 2x to 12x as the number of FPGAs scales from 1 to 6. This demonstrates that the speedup scales linearly with the number of FPGAs, indicating the speedup potential OctoRay offers and the low overhead of the framework.

Birds of a Feather

Open Cloud Infrastructure Solutions to Run HPC Workloads

Cloud Computing

XO/EX

DescriptionCloud-native methods are increasingly used for HPC infrastructure. The advantages claimed include agility in system management and flexible support of new and evolving workflows.

In the last ten years, open cloud infrastructure has become widespread in scientific computing and OpenStack is the dominant open source cloud solution. The OpenStack Scientific SIG represents this community.

This session brings together leading practitioners of OpenStack and related technologies for open solutions in production operations. The session will present current use cases of cloud-native open infrastructure. The advantages and challenges of this approach will be presented. Attendees will be invited to share experiences.

Birds of a Feather

Open MPI State of the Union

Programming Frameworks and System Software

XO/EX

DescriptionOpen MPI continues to drive the start of the art in HPC. This year, we've added new features, fixed bugs, improved performance, and collaborated with many across the HPC community. We'll discuss what Open MPI has accomplished over the past year and present a roadmap for the next year.

One of Open MPI's strengths lies in its diversity: we represent many different viewpoints across the HPC ecosystem. To that end, many developers from the community will be present to discuss and answer your questions both during and after the BoF.

Birds of a Feather

Open OnDemand User Group Meeting

Middleware and System Software

XO/EX

DescriptionThis BoF is meant to be an open discussion to guide the future roadmap for Open OnDemand (openondemand.org), by getting feedback from the community on the prioritization of the various tasks planned for the next few years. OOD is extremely relevant to ongoing discussions within the HPC community about user interfaces and science gateways. The session leaders, all part of the OOD development team, will jointly develop the content for the presentation in advance to ensure a wide range of viewpoints and topics are presented. We will also consult with our user advisory group in advance for their suggestions.

Workshop

Open Q&A Session

Quantum Computing

Software Engineering

DescriptionThe last session of the workshop will be an open session for discussion about all presentations during the event.

Birds of a Feather

OpenACC Users Forum

Programming Frameworks and System Software

XO/EX

DescriptionOpenACC organization helps researchers and developers advance science by expanding their parallel computing skills and supporting a directive-based, high-level parallel programming model on CPUs, GPUs, and more. OpenACC supports over 25 global hackathons annually and facilitated the acceleration of over 200 applications on multiple platforms (e.g., Frontier, Perlmutter, JUWELS, Summit, and Piz Daint). This BoF serves as a forum for OpenACC users, implementers, and the organization officers to openly discuss the status of OpenACC and its community. Presentations will be given by OpenACC officers, compiler implementers, and invited users, followed by an open mic discussion with the audience.

Workshop

OpenGPT-X: Advancements, Challenges, Exploration, and Future Goals

State of the Practice

DescriptionIn the era of Natural Language Processing with Large Language Models, the OpenGPT-X project brings forth a platform for researching, producing, and using language models. The project is a German initiative with ten collaborative partners, focusing their efforts to contribute a multilingual Open Source language model. Models trained within the project will also be used for pilot cases by industry partners and commercialized through Gaia-X federation. The memory and compute for training large language models efficiently demands high performance computing systems like JUWELS Booster. This paper outlines the advancements and challenges in the project from the perspective of Jülich Supercomputing Centre and showcases the results of exploration of novel hardware architecture conducted within the scope of the project.

Workshop

Artificial Intelligence/Machine Learning

Energy Efficiency

Green Computing

Performance Measurement, Modeling, and Tools

Sustainability

DescriptionOpening remarks by the organizers

Workshop

Artificial Intelligence/Machine Learning

Software Engineering

DescriptionOpening remarks for the AI4DEV Workshop

Workshop

Architecture and Networks

Workshop

Opening Remarks and Welcome

Artificial Intelligence/Machine Learning

Distributed Computing

Workshop

Data Analysis, Visualization, and Storage

Data Compression

Birds of a Feather

OpenMP API Version 6.0 – What to Expect

Programming Frameworks and System Software

XO/EX

DescriptionThis BoF is highly interactive and provides attendees with first-hand information from OpenMP implementers and language designers on the future of the OpenMP API. Lightning talks and discussion rounds will give BoF participants amble opportunity to learn and interact with OpenMP experts, ask questions, and provide community feedback. Sub-committee leaders of the OpenMP ARB will provide insight into the future of OpenMP, focusing on the upcoming release of the OpenMP API version 6.0 in November 2024 and the progress that has been made. Vendor representatives will discuss support and timelines for OpenMP features, and expert users will describe their journey.

Workshop

OpenMP Kernel Language Extensions for Performance Portable GPU Codes

Compilers

Heterogeneous Computing

Performance Optimization

DescriptionIn this work, we introduce extensions to LLVM OpenMP, transforming it into a versatile and performance portable kernel language for GPU programming. These extensions allow for the seamless porting of programs written in kernel languages to high-performance OpenMP GPU programs with minimal modifications. To evaluate our extension, we implemented a proof-of-concept prototype that contains a subset of extensions we proposed. We ported six established CUDA proxy and benchmark applications and evaluated their performance on both AMD and NVIDIA platforms. By comparing with native versions (HIP and CUDA), our results demonstrate that OpenMP, augmented with our extensions, can not only match but also in some cases exceed the performance of kernel languages, thereby offering performance portability with minimal effort from application developers.

Workshop

OpenSHMEM Queues: An Abstraction for Enhancing Message Rate, Bandwidth Utilization, and Reducing Tail Latency in OpenSHMEM Applications

Exascale

Message Passing

Programming Frameworks and System Software

Birds of a Feather

Operational Data Analytics

Energy Efficiency

State of the Practice

Sustainability

XO/EX

DescriptionOperational Data Analytics (ODA) provides unique opportunities to analyze, understand, and optimize operations of HPC systems. Readily available open-source frameworks make the collection of monitoring data from different domains of the HPC system increasingly easy. However, making the data work for HPC operations is not straight-forward and effort being duplicated at many HPC sites to develop methods and tools to analyze the data and leverage it for operations. There is a clear demand to collaborate on this within the community but as standards in terms of semantics and naming of monitoring data are currently missing, such collaboration is severely hampered.

Workshop

Operationalizing HPC Tasks for Space Weather Forecasting Using Celery and Django: Making Automated, HPC-Powered Scientific Results Accessible in Near-Real Time.

State of the Practice

DescriptionForecasting space weather conditions in the Earth’s ionosphere is critical to protecting key infrastructure, such as satellite-based positioning and navigation systems, high-frequency radio communications, and the electric power grid. Variations in space weather are caused by coronal mass ejections from the Sun’s surface, energizing electrons in the ionosphere to produce disturbances in communications and electrical systems, as well as spectacular aurorae.

We present a system for operationalizing HPC tasks for data assimilation in space weather forecasting using Celery and Django. Celery is used to execute and distribute tasks asynchronously, while Django is a popular web framework. Our system integrates these tools to automate running space weather simulations on an HPC cluster for data assimilation and presenting outputs on a website in near real-time. Our system applies to a wide range of HPC tasks in research software, and we believe this is useful for researchers to operationalize similar workflows.

Workshop

Optimization of Ported CFD Kernels on Intel Data Center GPU Max 1550 Using oneAPI ESIMD

Algorithms

Heterogeneous Computing

Large Scale Systems

DescriptionWe describe our experience porting FUN3D's CUDA-optimized kernels to Intel oneAPI SYCL. We faced several challenges, including the suboptimal performance of the oneAPI code on Intel's new data center GPU. The suboptimal performance of the oneAPI code was due to high register spills, memory latency, and poor vectorization. We addressed these issues by implementing the kernels using Intel oneAPI's Explicit SIMD SYCL extension (ESIMD) API. The ESIMD API enables the writing of explicitly vectorized kernel code, gives more precise control over register usage and prefetching, and better handles thread divergence compared to SYCL. The ESIMD code outperforms the optimized SYCL code by up to a factor of 3.6, depending on the kernel. We also compared the performance with the CUDA-optimized version on NVIDIA V100 and A100 GPUs. We found the performance of a single tile of the Intel GPU using ESIMD greater than NVIDIA V100 and similar to NVIDIA A100.

Workshop

Optimization Toward Efficiency and Stateful of dispel4py

Applications

Cloud Computing

Distributed Computing

Edge Computing

Large Scale Systems

DescriptionScientific workflows bridge scientific challenges with computational resources. While dispel4py, a stream-based workflow system, offers mappings to parallel enactment engines like MPI or Multiprocessing, its optimization primarily focuses on dynamic process-to-task allocation for improved performance. An efficiency gap persists, particularly with the growing emphasis on conserving computing resources. Moreover, the existing dynamic optimization lacks support for stateful applications and grouping operations.

To address these issues, our work introduces a novel hybrid approach for handling stateful operations and groupings within workflows, leveraging a new Redis mapping. We also propose an auto-scaling mechanism integrated into dispel4py's dynamic optimization. Our experiments showcase the effectiveness of auto-scaling optimization, achieving efficiency while upholding performance. In the best case, auto-scaling reduces dispel4py's runtime to 87% compared to the baseline, using only 76% of process resources. Importantly, our optimized stateful dispel4py demonstrates a remarkable speedup, utilizing just 32% of the runtime compared to the contender.

Workshop

Optimized Patient-Specific Catheter Placement for Convection-Enhanced Nanoparticle Delivery in Recurrent Glioblastoma

Applications

State of the Practice

DescriptionIntroduction: Glioblastoma multiforme (GBM) is the most common and deadliest of all primary brain cancers. One promising treatment strategy for patients with recurrent GBM is convection-enhanced delivery (CED) of Rhenium-186 (186Re)-nanoliposomes (RNL) to provide delivery of large, localized doses of radiation. The success of treatment by CED relies on proper catheter placement for therapy delivery to maximize tumor coverage and minimize the leakage to healthy tissue. In this project, we are developing an image-guided physics-based model to optimize catheter placement for RNL delivery on a patient-specific basis.

Methods: The mathematical model consists of 1) the steady-state flow field generated via the catheter infusion and the Darcy flow through the 3D brain domain, 2) the transport of RNL governed by an advection-diffusion equation, and 3) the point-spread function to transform the RNL distribution into the SPECT signal. Pre-delivery MRIs were used to assign patient-specific tissue geometries. Two scenarios were performed to personalize the model parameters: a) patient-specific calibration with longitudinal SPECT images monitoring RNL distributions, and b) population-based assignment with the leave-one-out cross-validation (LOOCV). The accuracy of model predictions was evaluated by the concordance correlation coefficients (CCC) between predicted and measured voxel-wise SPECT signals. Furthermore, in each patient, we used the image-guided model—with either calibrated or assigned parameters—to simulate RNL distributions for all possible locations of catheter tip(s), resulting in a ratio of the cumulative dose of RNL outside the tumor to that within the tumor, termed as “off-target ratio” (OTR). We minimized the OTR to optimize the placement of catheter(s), and compared OTRs obtained by the optimized and the original placements.

Results: Fifteen patients with recurrent GBM from a Phase I/II clinical trial of RNL were included in the study. For scenario a) with the patient-specific calibrated parameters, our model achieved median CCCs of 0.91, 0.87, and 0.82 for predicting RNL distributions at the mid-delivery, end-of-delivery, and 24 h post-delivery, respectively. For scenario b) with the LOOCV assigned parameters, our model achieved median CCCs of 0.89, 0.84, and 0.79 for predicting RNL distributions at the mid-delivery, end-of-delivery, and 24 h post-delivery, respectively. Compared to the original catheter placements, the optimized placements with the patient-specifically calibrated model achieved a median (range) of 34.56% (14.70% – 61.12%) reduction on OTR at the 24h post-delivery. Similarly, the optimized placements with the LOOCV assigned model achieved a 34.56% (13.30% – 56.62%) reduction on OTR at the 24h post-delivery. Furthermore, the optimization provides insights into whether a patient is a proper candidate for CED of RNL, and whether a reduction of catheter number is possible for the patient.

Conclusion: Our image-guided model, with either patient-specific calibrated parameters or LOOCV assigned parameters, achieved high accuracy for predicting RNL distributions up to 24 h after the RNL delivery. The placement of catheter(s) optimized via our modeling substantially reduced the off-target ratio of RNL delivery. These results proved the potential of our image-guided modeling to guide patient-specific optimization of catheter placement for convection-enhanced delivery of radiolabeled liposomes.

Acknowledgments: NCI R01CA235800, U01CA253540, and R01CA260003. CPRIT RR160005.

Workshop

Optimized Uncertainty Estimation for Vision Transformers: Enhancing Adversarial Robustness and Performance Using Selective Classification

Performance Optimization

DescriptionDeep Learning models frequently produce high-confidence softmax outputs for out-of-distribution (OOD) inputs, which would ideally be classified as "I don't know". To enhance our model's trustworthiness, we incorporate selective classification, which entails abstaining from predictions in situations of doubt. This approach requires initial uncertainty estimation. Subsequently, instead of offering a singular prediction, we provide a distribution over predictions, enabling users to discern if the model is trustworthy or if consultation with a human expert is necessary. In this paper, we assess uncertainty in two baseline models: a Convolutional Neural Network (CNN) and a Vision Transformer (ViT). Leveraging these uncertainty values, we minimize errors by refraining from predictions during high uncertainty. Additionally, we evaluate these models across various distributed architectures.

Paper

Optimizing Direct Convolutions on ARM Multi-Cores

Artificial Intelligence/Machine Learning

Codesign

Performance Optimization

Programming Frameworks and System Software

DescriptionConvolution kernels are widely seen in deep learning workloads and are often responsible for performance bottlenecks. Recent research has demonstrated that a direct convolution approach can outperform the traditional convolution implementation based on tensor-to-matrix conversions. However, existing approaches for direct convolution still have room for performance improvement. We present NDIRECT, a new direct convolution approach that targets ARM-based multi-core CPUs commonly found in smartphones and HPC systems. NDIRECT is designed to be compatible with the data layout formats used by mainstream deep learning frameworks but offers new optimizations for the computational kernel, data packing, and parallelization. We evaluate NDIRECT by applying it to representative convolution kernels and demonstrating its performance on four distinct ARM multi-core CPU platforms. We compare NDIRECT against state-of-the-art convolution optimization techniques. Experimental results show that NDIRECT gives the best overall performance across evaluation scenarios and platforms.

Paper

Optimizing High-Performance Linpack for Exascale Accelerated Architectures

Accelerators

Algorithms

Linear Algebra

DescriptionWe detail the performance optimizations made in rocHPL, AMD's open-source implementation of the High-Performance Linpack (HPL) benchmark targeting accelerated node architectures designed for exascale systems such as the Frontier supercomputer. The implementation leverages the high-throughput GPU accelerators on the node via highly optimized linear algebra libraries, as well as the entire CPU socket to perform latency-sensitive factorization phases. We detail novel performance improvements such as a multi-threaded approach to computing the panel factorization phase on the CPU, time-sharing of CPU cores between processes on the node, as well as several optimizations which hide MPI communication. We present some performance results of this implementation of the HPL benchmark on a single node of the Frontier early access cluster at Oak Ridge National Laboratory, as well as scaling to multiple nodes.

Workshop

Optimizing Irregular Communication with Neighborhood Collectives and Locality-Aware Parallelism

Exascale

Message Passing

Programming Frameworks and System Software

DescriptionIrregular communication limits both performance and scalability of parallel applications. Typically, it is implemented as point-to-point, and optimizations are integrated into the application, lacking portability. Optimization of point-to-point messages within MPI is difficult, as the interface only provides information on a piece of overall communication. However, persistent neighbor collectives expose a suitable interface for such optimizations.

This paper presents methods for implementing existing optimizations for irregular communication within neighborhood collectives, analyzes the impact of neighborhood collectives in Hypre BoomerAMG, and shows up to a 1.38x speedup on sparse matrix-vector multiplication using optimized neighbor collectives. The authors analyze three implementations of neighborhood collectives for Alltoallv: an unoptimized wrapper of standard point-to-point communication, and two locality-aware aggregating methods. The second exposes a non-standard interface to perform additional optimization for an additional 0.07x speedup.

Optimizations are available open-source in MPI Advance which wraps MPI, allowing use with any MPI installation.

Paper

Optimizing MPI Collectives on Shared Memory Multi-Cores

Distributed Computing

Message Passing

Programming Frameworks and System Software

Best Student Paper Finalist

DescriptionCollective communication operations, such as broadcasting and reductions, often contribute to performance bottlenecks in Message Passing Interface (MPI) programs. As the number of processor cores integrated into CPUs increases, running multiple MPI processes on shared-memory machines to leverage hardware parallelism is becoming increasingly common. In this context, optimizing MPI collective communications for shared-memory execution is crucial. This paper identifies two primary limitations of existing MPI collective implementations on shared-memory systems. The first is the extensive redundant data movements when performing reduction collectives, and the second is the ineffective use of non-temporal instructions to optimize streamed data processing. To address these challenges, we propose two optimization techniques designed to minimize data movements and enhance the use of non-temporal instructions. We integrate our optimizations into the OpenMPI and evaluate their performance through micro-benchmarks and real-world application tests on two multi-core clusters. Experiments show that our approach significantly outperforms existing techniques by 1.2-6.4x.

Paper

Optimizing Reconfigurable Optical Datacenters: The Power of Randomization

Algorithms

Cloud Computing

Distributed Computing

Heterogeneous Computing

Large Scale Systems

State of the Practice

DescriptionReconfigurable optical topologies are a promising new technology to improve datacenter network performance and cope with the explosive growth of traffic. In particular, these networks allow to adaptively connect racks between which there is currently much traffic, hence making an optimal use of the bandwidth by avoiding multi-hop forwarding.

This paper studies the dynamic optimization of such reconfigurable topologies, adapting to the traffic in an online manner. The underlying algorithmic problem can be described as an online maximum weight b-matching problem, a generalization of maximum weight matching where each node has at most b>=1 incident matching edges.

We make the case for a randomized approach for matching optimization. Our main contribution is a O(log b)-competitive algorithm and we show that it is asymptotically optimal. This algorithm is exponentially better than the best possible deterministic online algorithm.

We complement our theoretical results with trace-driven simulations, based on real-world datacenter workloads.

Posters

Research Posters

Optimizing Uncertainty Quantification of Vision Transformers in Deep Learning on Novel AI Architectures

XO/EX

DescriptionDeep Learning (DL) methods have shown substantial efficacy in computer vision (CV) and natural language processing (NLP). Despite their proficiency, the inconsistency in input data distributions can compromise prediction reliability. This study mitigates this issue by introducing uncertainty evaluations in DL models, thereby enhancing dependability through a distribution of predictions. Our focus lies on the Vision Transformer (ViT), a DL model that harmonizes both local and global behavior. We conduct extensive experiments on the ImageNet-1K dataset, a vast resource with over a million images across 1,000 categories. ViTs, while competitive, are vulnerable to adversarial attacks, making uncertainty estimation crucial for robust predictions.

Our research advances the field by integrating uncertainty evaluations into ViTs, comparing two significant uncertainty estimation methodologies, and expediting uncertainty computations on high-performance computing (HPC) architectures, such as the Cerebras CS-2, SambaNova DataScale, and the Polaris supercomputer, utilizing the MPI4PY package for efficient distributed training.

Posters

Research Posters

Optimizing Workflow Performance by Elucidating Semantic Data Flow

XO/EX

DescriptionDistributed scientific workflows are becoming data-intensive, and the data movement through storage systems often causes bottleneck. Therefore, it is critical to understand data flow. Many scientific datasets incorporate domain semantics with formats like HDF and NetCDF, enhancing the interpretability and context of the data for analysis. We shed new insights on workflow bottlenecks by understanding how semantic data sets flow through storage. We unveil a fresh perspective with careful runtime measurement, recovering the mapping of domain semantics to low-level I/O operations, and effective visualization and analysis of semantic flows.

Workshop

Optimizing Write Performance for Checkpointing to Parallel File Systems Using LSM-Trees

Fault Handling and Tolerance

Large Scale Systems

DescriptionThe widening gap between compute and I/O performances on modern HPC systems means that writing checkpoints to a parallel file system for fault tolerance is fast becoming a bottleneck to high-performance. It is therefore vital that software is engineered such that it can achieve the highest proportion of available performance on the underlying hardware; and this is a burden often carried by I/O middleware libraries. In this paper, we outline such an I/O library based on a Log-structured Merge Tree (LSM-Tree), not just for metadata, but also scientific data. We benchmark its performance using the IOR benchmark, demonstrating 2.4 to 76.7x better performance than alternative file formats, such as ADIOS2, HDF5, and IOR baseline when running on a Lustre Parallel File System. We further demonstrate that when our LSM-Tree I/O library is used as a storage-layer for ADIOS2, the resulting I/O library still outperforms the default ADIOS2 implementation by 1.5x.

Workshop

Our Success Case of Full Remote Working

State of the Practice

DescriptionRemote work has been widely adopted by tech companies since the lockdown more than three years ago. However, some companies are asking their employees to come back to their seats with the promise to change taught meetings by chatting next to the coffee machine.

In this talk, I'll show what we have done since the company was founded 11 years ago to build and keep together a team separated several kilometers away.

Workshop

Overcoming Active Directory Woes with Plain Text Caches and Replacing Passwords

Security

State of the Practice

DescriptionReliable authentication is a key component of all HPC systems. This paper discusses an approach that bypasses systemic authentication problems experienced by the authors to provide a simple and reliable manner of managing service accounts and user groups for HPC centers using plain text caches and alternatives to passwords.

Workshop

Overcoming the Challenges to Democratizing Precision Medicine: HPC Infrastructure, Health Equity Training Sets, Training a Diverse Workforce, and Mitigating Fears

Applications

State of the Practice

DescriptionThis presentation will discuss the challenges and opportunities of democratizing precision medicine to make accessible to everyone, including underrepresented populations. The talk will describe high performance computer infrastructure needed by community health learning systems, health equity training sets, training biomedical informatics, bioinformatics, and precision medicine workforce, and how best to address the lack of trust in clinical trials. The presentation will highlight the importance of these areas in achieving health equity and democratizing precision medicine. The goal of this discussion is to inspire and inform the audience about the potential of these areas to make precision medicine a reality for everyone.

Exhibitor Forum

Overcoming the Cost of Data Movement in AI Inference Accelerators

Accelerators

Artificial Intelligence/Machine Learning

Architecture and Networks

Hardware Technologies

XO/EX

DescriptionThe largest performance bottleneck and energy usage in neural network acceleration is the fetching of weight and activation values prior to general matrix-vector (GEMV) or general matrix-matrix (GEMM) computation. Traditional von Neumann architectures, even with large on-chip caches, consume as much as 90% of their energy in data movement and only 10% for actual calculations, which limits their energy efficiency to, in most cases, low single digit TOPs/W. Analog in-memory compute, where the memory cell is used as part of the MAC calculation, suffers from accuracy issues and the required additional support circuitry, such as analog-to-digital and digital-to-analog converters, and compensation which obviates the inherent low-power advantages, limiting the state of the art to 3 TOPs/W.

The novel Untether AI at-memory compute architecture stores all weights directly on-chip in specially designed low-power SRAM using high-density bit cells that are tuned to directly feed the processing elements (PEs) using minimal energy. Because the PEs are directly adjacent to the SRAM cells, it only uses 2 femtojoules per bit-access. This innovation represents an order of magnitude improvement over compiled memory cells, and three orders of magnitude compared to fetching weights from external DRAM.

Doctoral Showcase

Posters

Overcoming the Gap between Compute and Memory Bandwidth in Modern GPUs

Accelerators

DescriptionThe imbalance between compute and memory bandwidth has been a long-standing issue. Despite efforts to address it, the gap between them is still widening. This has led to the categorization of many applications as memory-bound kernels.

This dissertation centers on memory-bound kernels, with a particular emphasis on Graphics Processing Units (GPUs), given their rising prevalence in High-Performance Computing (HPC) systems.

In this dissertation, we initially focus on the evolution trend of GPU development in the last decades. Examples include cooperative groups (i.e., device-wide barriers), asynchronous copy of shared memory (i.e., hardware prefetching), low(er) latency of operations, and larger volume of on-chip resources (register files and L1 cache).

This dissertation seeks to utilize the latest GPU features to optimize memory-bound kernels. Specifically, we propose extending the kernel's lifetime across the time steps and taking advantage of the large volume of on-chip resources (i.e., register files and scratchpad memory) in reducing or eliminating traffic to the device memory. Furthermore, we champion a minimum level of parallelism to maximize the available on-chip resources.

Based on the strategies, we propose a general execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS) and a novel temporal blocking method, EBISU. Evaluations have shown outstanding performance in the latest GPU architectures compared with counterpart state-of-the-art implementations.

Workshop

P3HPC – Morning Break

Performance Measurement, Modeling, and Tools

Performance Optimization

Workshop

P3HPC – Welcome and Introduction

Performance Measurement, Modeling, and Tools

Performance Optimization

Workshop

P3HPC – Wrapup

Performance Measurement, Modeling, and Tools

Performance Optimization

Workshop

Panel Discussion

Artificial Intelligence/Machine Learning

Distributed Computing

Workshop

Panel Discussion

Compilers

Heterogeneous Computing

Performance Optimization

DescriptionPanel discussion centering on the state of LLVM for HPC.

Workshop

Panel Discussion

Architecture and Networks

Data Movement and Memory

Resource Management

Workshop

Panel: Diversity, Equity, and Inclusion – From Data to Workforce

Applications

State of the Practice

Paper

PanguLU: A Scalable Regular Two-Dimensional Block-Cyclic Sparse Direct Solver on Distributed Heterogeneous Systems

Accelerators

Algorithms

Linear Algebra

Best Paper Finalist

DescriptionSparse direct solvers play a vital role in large-scale high performance computing in science and engineering. Existing distributed sparse direct methods employ multifrontal/supernodal patterns to aggregate columns of nearly identical forms and to exploit dense basic linear algebra subprograms (BLAS) for computation. We propose a new sparse direct solver called PanguLU. Our work relies on simpler regular 2D blocking and stores blocks in their sparse forms to avoid any extra fill-ins. Based on sparse patterns of blocks, a variety of block-wise sparse BLAS methods are developed and selected for higher efficiency on local GPUs. To make PanguLU more scalable, we also adjust mapping of blocks to processes for overall more balanced workload, and propose a synchronization-free communication strategy to reduce overall latency overhead. Experiments on two distributed heterogeneous platforms consisting of 128 A100 GPUs and 128 MI50 GPUs demonstrate that PanguLU achieves up to 11.70x and 17.97x speedups over SuperLU_DIST.

Posters

Research Posters

PanSim: A Performance-Portable Agent Based Model

XO/EX

DescriptionPanSim, a specialized agent-based model, was developed to analyze interventions against COVID-19. Implemented in C++ and Thrust, it is a highly performant and portable code. Here we focus on different algorithmic formulations for calculating cumulative values like infectiousness at different locations. A detailed comparison of time and efficiency on different CPUs and GPUs was conducted, revealing suboptimal parallel efficiency. The time to execute 704 simulations on each platform was evaluated, emphasizing overall throughput instead of latency for more taxing workloads. We benchmarked modern CPU and GPU architectures, revealing the superior performance of NVIDIA A100 and AMD Genoa-X platforms. Additionally, the monetary cost associated with executing the simulations was analyzed, presenting a contrasting landscape in on-demand and spot pricing. Ampere Altra platform emerged as the most cost-effective. The findings contribute to understanding the efficiency, time, and cost dynamics in modeling and provide insights for the practice of pandemic response planning.

Tutorial

Parallel Computing 101

Algorithms

Heterogeneous Computing

Message Passing

TUT

DescriptionThis tutorial provides a comprehensive overview of parallel computing, emphasizing those aspects most relevant to the user. It is suitable for new users, students, managers, and anyone seeking an overview of parallel computing. It discusses software and hardware/software interaction, with an emphasis on standards, portability, and systems that are widely available.

The tutorial surveys basic parallel computing concepts, using examples selected from multiple engineering, scientific, and machine learning problems. These examples illustrate using MPI on distributed memory systems; OpenMP on shared memory systems; MPI+OpenMP on hybrid systems; and CUDA and compiler directives on GPUs and accelerators. It discusses numerous parallelization and load balancing approaches, and software engineering and performance improvement aspects, including the use of state-of-the-art tools.

The tutorial helps attendees make intelligent decisions by covering the primary options that are available, explaining how the different components work together and what they are most suitable for. Extensive pointers to web-based resources are provided to facilitate follow-up studies.

Tutorial

Parallel I/O In Practice

Architecture and Networks

I/O and File Systems

TUT

DescriptionI/O on HPC systems is a black art. This tutorial sheds light on the state-of-the-art in parallel I/O and provides the knowledge necessary for attendees to best leverage I/O resources available to them. We cover the entire I/O software stack including storage and parallel file systems at the lowest layer, the role of NVRAM devices, intermediate layers (such as MPI-IO), and high-level I/O libraries (such as HDF-5). We emphasize ways to use these interfaces that result in high performance and tools for generating insight into these stacks.

Our first third of the tutorial covers parallel I/O fundamentals. We discuss storage technologies, both present and near-future and the major parallel and distributed file systems. We focus on application in our second third, connecting storage to our examination of the upper library layers of the I/O stack, covering MPI-IO, Parallel netCDF, and HDF5. Finally, we discuss tools for understanding I/O behavior.

Posters

Research Posters

Parallel Optimization Methods for Direct Numerical Simulation of High Reynolds Number Wall Turbulence with a Grid Size of 100 Billion

XO/EX

DescriptionDirect numerical simulation (DNS) is a technique that directly solves the fluid Navier-Stokes equations with high spatial and temporal resolutions. However, its utility in studying high Reynolds number (Re) wall turbulence of particular interest is limited by the rapidly growing grid size (i.e., the memory and computation requirement) with Re^3.

We present PowerLLEL, a high-performance finite difference solver tailored for the challenging DNS of incompressible wall turbulence at extreme scales. An adaptive multi-level parallelization strategy is proposed to fully exploit the multi-level parallelism of various architectures and enhance computational performance. The communication performance of global transpose and halo exchange is significantly improved by a tridiagonal solver based on the parallel diagonal dominant (PDD) algorithm and three RDMA-implemented communication optimizations. Strong scaling tests on the Tianhe-2A supercomputer show that PowerLLEL achieves nearly 92% parallel efficiency with up to 31,104 cores on a grid size of 143.3 billion.

Workshop

Parallel Symbolic Cholesky Factorization

Algorithms

Heterogeneous Computing

Large Scale Systems

DescriptionWe present a hybrid sequential/parallel symbolic Cholesky factorization algorithm that computes the sparsity pattern of the symbolic factors in parallel. We evaluate the performance on a large subset of the SuiteSparse matrix collection and multicore CPUs as well as flagship GPUs by AMD and NVIDIA, achieving speedups of an order of magnitude compared to a state-of-the-art sequential symbolic Cholesky factorization.

Paper

Parallel Top-K Algorithms on GPU: A Comprehensive Study and New Methods

Accelerators

Algorithms

Graph Algorithms and Frameworks

DescriptionThe top-K problem is an essential part of many important applications in scientific computing, information retrieval, etc. As data volume grows rapidly, high-performance parallel top-K algorithms become critical. We propose two parallel top-K algorithms, AIR top-K (Adaptive and Iteration-fused Radix top-K) and GridSelect, for GPU. AIR top-K employs an iteration-fused design to minimize CPU-GPU communication and device data access. Its adaptive strategy eliminates unnecessary device memory traffic automatically under various data distributions. GridSelect can process data on-the-fly. It adopts a shared queue and parallel two-step insertion to decrease the frequency of costly operations. We comprehensively compare 8 open-source GPU implementations and our methods for a wide range of problem sizes and data distributions. For batch sizes 1 and 100, respectively, AIR top-K shows 1.98-21.48X and 8.01-574.78X speedup over previous radix top-K algorithm, and 1.44-7.34X and 1.38-31.91X speedup over state-of-the-art methods. GridSelect shows up to 882.29X speedup over its baseline.

Workshop

Parallelizing a 1-Dim Nagel-Schreckenberg Traffic Model

Education

State of the Practice

DescriptionThe Nagel-Schreckenberg model is a stochastic one-dimensional traffic model. In this assignment, we guide students through the process of implementing a shared-memory parallel and reproducible version of an existing serial code that implements this model, and to analyze its scaling behavior.

One of the key elements in this traffic model is the presence of randomness, without which it would lack realistic phenomena such as traffic jams. Its implementation thus requires techniques associated with Monte Carlo simulations and pseudo-random number generation (PRNG). PRNGs are notoriously tricky to deal with in parallel when combined with the requirement of reproducibility.

This assignment was created for the graduate course PHY1610 Scientific Computing for Physicists at the University of Toronto, which had its origin in the training program of the SciNet HPC Consortium, and is also very suitable for other scientific disciplines. Several variations of the assignment have been used over the years.

Workshop

Pareto Optimization of CNN Models via Hardware-Aware Neural Architecture Search for Drainage Crossing Classification on Resource-Limited Devices

Accelerators

Codesign

Heterogeneous Computing

Task Parallelism

DescriptionEmbedded devices, constrained by limited memory and processors, require deep learning models to be tailored to their specifications. This research explores customized model architectures for classifying drainage crossing images. Building on the foundational ResNet-18, this paper aims to maximize prediction accuracy, reduce memory size, and minimize inference latency. Various configurations were systematically probed by leveraging hardware-aware neural architecture search, accumulating 1,717 experimental results over six benchmarking variants. The experimental data analysis, enhanced by nn-Meter, provided a comprehensive understanding of inference latency across four different predictors. Significantly, a Pareto front analysis with three objectives of accuracy, latency, and memory resulted in five non-dominated solutions. These standout models showcased efficiency while retaining accuracy, offering a compelling alternative to the conventional ResNet-18 when deployed in resource-constrained environments. The presentation concludes by highlighting insights drawn from the results and suggesting avenues for future exploration.

Posters

Research Posters

ParLeiden: Boosting Parallelism of Distributed Leiden Algorithm on Large-Scale Graphs

XO/EX

DescriptionLeiden algorithm has demonstrated superior efficacy compared to traditional Louvain algorithms in the field of community detection. However, parallelizing the Leiden algorithm while imposing community size limitations brings significant challenges in big data processing scenarios. We present ParLeiden, a pioneering parallel Leiden strategy designed for distributed environments. By thread locks and efficient buffers, we effectively resolve community joining conflicts and reduce communication overheads. We can run Leiden algorithm on large-scale graphs and achieve performance speedup on up to 9.8 times than baselines.

Birds of a Feather

Pathfinding in HPC Education and Training

Education

XO/EX

DescriptionDespite the quantity of existing training materials, acquisition and development of HPC skills is not straightforward enough to address the needs of the growing and diversifying HPC community. To address this, the HPC teaching and training ecosystem must mirror the growth and diversification of the HPC community and technologies. This BoF creates an opportunity to gather the user/learner community perspectives and explore new requirements in order to identify new entry points and build well-defined learning pathways that more accurately represent the aims of the user/learner community and changing technology landscape. We encourage those interested in HPC training to attend.

Workshop

Patterns and Anti-Patterns in Migrating from Legacy Workflows to Workflow Management Systems

Applications

Cloud Computing

Distributed Computing

Edge Computing

Large Scale Systems

DescriptionPatterns and Anti-Patterns in Migrating from Legacy Workflows to Workflow Management Systems

Workshop

PAW-ATM – Afternoon Break

Applications

Distributed Computing

Compilers

Heterogeneous Computing

Message Passing

Programming Frameworks and System Software

Task Parallelism

Workshop

PAW-ATM – Lunch Break

Applications

Distributed Computing

Compilers

Heterogeneous Computing

Message Passing

Programming Frameworks and System Software

Task Parallelism

Workshop

PAW-ATM – Morning Break

Applications

Distributed Computing

Compilers

Heterogeneous Computing

Message Passing

Programming Frameworks and System Software

Task Parallelism

Workshop

PAW-ATM Distinguished Speaker: Ethan Gutmann – National Center for Atmospheric Research: Trials and Tribulations and Joys of Developing with Alternative Parallel Frameworks

Applications

Distributed Computing

Compilers

Heterogeneous Computing

Message Passing

Programming Frameworks and System Software

Task Parallelism

DescriptionDeveloping large scientific applications is challenging for many reasons, and alternative programming can help with better support for the implementation. These applications need to incorporate the latest domain specific scientific information, be applicable to real world problems, and be robust across a wide variety of inputs. For many models, parallelization is only seen as a necessary hardship, and once developed, the parallelization logic is left untouched, sometimes for decades. Alternative parallel application programming implementations such as Coarray Fortran, Chapel, or UPC++ promise to make implementing and maintaining such parallel logic easier; however, the alternative nature of these implementations often means they do not have the support in the operational HPC community to make that dream a reality. Here we discuss some of the problems that arise implementing the parallelization logic for two different models in Coarray Fortran. We highlight an example from the Intermediate Complexity Atmospheric Research model (ICAR) in which the Partitioned Global Address Space (PGAS) programming model of Coarray Fortran made implementing the generation of a massive lookup-table in parallel almost trivial. We then discuss issues that have arisen since the initial implementation due to inconsistencies in compiler implementations. Improving the support for such parallel frameworks is a bit of a chicken-and-egg problem. Compiler writers do not wish to devote resources to features that are not widely used, and developers do not want to use features without robust support.

Workshop

PAW-ATM Panel Discussion: Charting Paths to Success with Alternatives to MPI+X

Applications

Distributed Computing

Compilers

Heterogeneous Computing

Message Passing

Programming Frameworks and System Software

Runtime Systems

Task Parallelism

DescriptionDifferent aspects of the workshop and other questions from the moderator and audience will be discussed in the panel.

Workshop

PDSW – Afternoon Break

Data Analysis, Visualization, and Storage

Data Movement and Memory

Workshop

PEAK: A Light-Weight Profiler for HPC Systems

Programming Frameworks and System Software

DescriptionIn the context of the expanding landscape of contemporary High-Performance Computing (HPC) applications from petascale to exascale, the pursuit of performance optimization emerges as a significant impediment within software development endeavors. In the meantime, the escalating intricacies inherent in parallel architectures and systems serve to compound the challenges associated with performance enhancement.

Here, we introduce PEAK (Performance Evaluation and Analysis Kit), a light-weight profiling tool developed with a specific focus on large-scale HPC applications. Using Dynamic Binary Instrumentation, PEAK is able to profile large-scale multi-threaded, multi-process applications with low overhead and high accuracy. We analyzed the overhead and accuracy of PEAK using synthetic benchmarks and real applications and compared it against the other widely used HPC profiling tools available. Our demonstration underscores that PEAK exhibits comparable overhead and accuracy to alternative profiling tools, while preserving its inherent simplicity.

Paper

PeeK: A Prune-Centric Approach for K Shortest Path Computation

Accelerators

Algorithms

Graph Algorithms and Frameworks

DescriptionThe 𝐾 shortest path (KSP) algorithm, which finds the top 𝐾 shortest simple paths from a given source to a target vertex, has a wide range of real-world applications. While the top 𝐾 shortest simple paths offer invaluable insights, computing them is time-consuming. In this work, we observe existing works search 𝐾 shortest paths from the original graph, while the top 𝐾 shortest paths only cover a meager portion of the original graph. This paper devises PeeK. It first applies 𝐾 upper bound pruning to prune the vertices and edges that will not appear in any of the 𝐾 shortest paths. Second, PeeK adaptively compacts the graph that, not only removes the deleted vertices or edges but also efficiently computes the downstream task. We compare PeeK with five algorithms. For parallel computation with 32 threads, PeeK achieves 5.1x and 28.8x speedup over the state-of-the-art for 𝐾 = 8, 128, respectively.

Workshop

Performance Engineering for Graduate Students: a View from Amsterdam

Education

State of the Practice

DescriptionHPC relies on experts to design, implement, and tune (computational science) applications that can efficiently use current (super)computing systems. As such, we strongly believe we must educate our students to ensure their ability to drive this activities, together with the domain experts. To this end, in 2018, we have designed a performance engineering course that, inspired by several conference-like tutorials, covers the principles and practice of performance engineering: benchmarking, performance modeling, and performance improvement. We describe the goals, learning objectives, and structure of the course, share students feedback and evaluation data, and discuss the lessons learned. After teaching the course five times, our results show that the course is tough (as expected) but very well received, with high-scores and several students continuing on the path of performance engineering during and after their master studies.

Workshop

Performance Evaluation of Heterogeneous GPU Programming Frameworks for Hemodynamic Simulations

Performance Measurement, Modeling, and Tools

Performance Optimization

DescriptionPreparing for the deployment of large scientific and engineering codes on upcoming exascale systems with GPU-dense nodes is made challenging by the unprecedented diversity of device architectures and heterogeneous programming models. In this work, we evaluate the process of porting a massively parallel, fluid dynamics code written in CUDA to SYCL, HIP, and Kokkos with a range of backends, using a combination of automated tools and manual tuning. We use a proxy application along with a custom performance model to inform the results and identify additional optimization strategies. At scale performance of the programming model implementations are evaluated on pre-production GPU node architectures for Frontier and Aurora, as well as on current NVIDIA device-based systems Summit and Polaris. Real-world workloads representing 3D blood flow calculations in complex vasculature are assessed. Our analysis highlights critical trade-offs between code performance, portability, and development time.

Workshop

Performance Portability Evaluation of Blocked Stencil Computations on GPUs

Performance Measurement, Modeling, and Tools

Performance Optimization

DescriptionIn this new era where multiple GPU vendors are leading the supercomputing landscape, and multiple programming models are available to users, the drive to achieve performance portability across platforms faces new challenges. Consider stencil algorithms, where architecture-specific solutions are required to optimize for the parallelism hierarchy and memory hierarchy of emerging systems. In this work, we analyze performance portability of the BrickLib domain-specific library and vector code generator for stencils. BrickLib employs fine-grain data blocking to reduce the large amount of data movement associated with stencils. We compare different GPUs (NVIDIA, AMD and Intel) and their associated programming models (CUDA, HIP and SYCL). By testing a wide range of stencil configurations, we show that overall, BrickLib achieves good performance independent of machine or programming model. Moreover, we introduce correlation models as a new tool for comparing architectures and programming models from Roofline model data.

Workshop

Performance Portability in the Age of Extreme Heterogeneity

Large Scale Systems

Middleware and System Software

Programming Frameworks and System Software

DescriptionMoore’s Law is a techno-economic model that has enabled the IT industry to double the performance and functionality of digital electronics roughly every 2 years within a fixed cost, power and area. This expectation has led to a relatively stable ecosystem (e.g. electronic design automation tools, compilers, simulators and emulators) built around general-purpose processor technologies, such as the x86, ARM and Power instruction set architectures. However, the historical improvements in performance offered by successive generations of lithography are waning while costs for new chip generations are growing rapidly. In the near term, the most practical path to continued performance growth will be architectural specialization in the form of many different kinds of accelerators. New software implementations, and in many cases new mathematical models and algorithmic approaches, are necessary to advance the science that can be done with these specialized architecture. This trend will not only continue but also intensify as the transition from multi-core systems to hybrid systems has already caused many teams to re-factor and redesign their implementations. But the next step to systems that exploit not just one type of accelerator but a full range of heterogeneous architectures will require more fundamental and disruptive changes in algorithm and software approaches. This applies to the broad range of algorithms used in simulation, data analysis and learning. New programming models or low-level software constructs that hide the details of the architecture from the implementation can make future programming less time-consuming, but they will not eliminate nor in many cases even mitigate the need to redesign algorithms. Future software development will not be tractable if a completely different code base is required for each different variant of a specialized system.

The aspirational desire for “minimizing the number of lines of code that must be changed to migrate to different systems with different arrangements of specialization” is encapsulated in the loaded phrase “Performance Portability.” However, performance portability is likely not an achievable goal if we attempt to do it using imperative languages like Fortran and C/C++. There is simply not enough flexibility built in to the specification of the algorithm for a compiler to do anything other than what the algorithm designer explicitly stated in their code. To make this future of diverse accelerators usable and accessible in the former case will require the co-design of new compiler technology and domain- specific languages (DSLs) designed around the requirements of the target computational motifs. The higher levels of abstraction and declarative semantics offered by DSLs enable more degrees of freedom to optimally map the algorithms onto diverse hardware than traditional imperative languages that over-prescribe the solution. Because this will drastically increase the complexity of the mapping problem, new mathematics for optimization will be developed, along with better performance introspection (both hardware and software mechanisms for online performance introspection) through extensions to the roofline model. Use of ML/AI technologies will be essential to enable analysis and automation of dynamic optimizations.

Workshop

Performance Portability of Programming Strategies for Nearest-Neighbor Communication with GPU-Aware MPI

Performance Measurement, Modeling, and Tools

Performance Optimization

DescriptionTo better advise HPC application developers, we have implemented Faces, a nearest-neighbor microbenchmark that quantifies performance trade-offs. The Faces experiments presented here explore the following design choices: 1) fewer dependent messages versus more independent messages, 2) fewer fused GPU kernels versus more simple kernels, 3) number of GPU streams, 4) size of GPU thread blocks, and 5) linear versus blocked ordering of MPI ranks. We present weak-scaling performance of a latency-sensitive "small'' per-rank domain and of a bandwidth-sensitive "large'' per-rank domain, and we compare results for two high-performance computers with contrasting CPU, GPU, and interconnect architectures: Summit and Frontier. We find that using more independent messages tends to give better performance than using few dependent messages. We identify performance-portability recommendations for GPU streams and synchronization, but other aspects of performance show complicated dependence on problem size and computer.

Tutorial

Performance Tuning with the Roofline Model on GPUs and CPUs

Accelerators

Heterogeneous Computing

Performance Measurement, Modeling, and Tools

Performance Optimization

Software Engineering

TUT

DescriptionThe Roofline performance model offers an insightful and intuitive method for extracting the key execution characteristics of HPC applications and comparing them against the performance bounds of modern CPUs and GPUs. Its ability to abstract the complexity of memory hierarchies and identify the most profitable optimization techniques have made Roofline-based analysis increasingly popular in the HPC community. Although different flavors of the Roofline model have been developed to deal with various definitions of memory data movement, there remains a need for a systematic methodology when applying them to analyze applications running on multicore and accelerated systems. The tutorial aims to bridge this gap on both CPUs and GPUs by both exposing the fundamental aspects behind different Roofline modeling principles as well as providing several practical use case scenarios that highlight their efficacy for application optimization. This tutorial presents a unique combination of instruction to Roofline by its creator, hands-on instruction in using Roofline within Intel’s, NVIDIA’s, and AMD’s production performance tools, and discussions of real-world Roofline use cases at ALCF, NERSC, and OLCF computing centers. The tutorial presenters have a long history of collaborating on the Roofline model and have presented several Roofline-based tutorials.

Workshop

Performance-Portable GPU Acceleration of the EFIT Tokamak Plasma Equilibrium Reconstruction Code

Accelerators

Applications

Compilers

Heterogeneous Computing

Programming Frameworks and System Software

Runtime Systems

DescriptionWe present the steps followed to GPU-offload parts of the core solver of EFIT-AI, an equilibrium reconstruction code suitable for tokamak experiments and burning plasmas. For this work, we will focus on the fitting procedure that consists of a Grad–Shafranov (GS) equation inverse solver that calculates equilibrium reconstructions on a grid. We will show profiling results of the original code(CPU-baseline), as well as the directives used to GPU-offload the most time-consuming function, initially to compare OpenACC and OpenMP on NVIDIA and AMD GPUs and later on to assess OpenMP performance portability on NVIDIA, AMD and Intel GPUs. We will make a performance comparison for different grid sizes and show the speedup achieved on NVIDIA A100 (Perlmutter-NERSC), AMD MI250X (Frontier-OLCF) and Intel PVC GPUs (Sunspot-ALCF). Finally, we will draw some conclusions and recommendations to achieve high-performance portability for an equilibrium reconstruction code on the new HPC architectures.

Posters

Research Posters

Performant Low-Order Matrix-Free Finite Element Kernels on GPE Architectures

XO/EX

DescriptionNumerical methods such as the Finite Element Method (FEM) have successfully leveraged the computational power of GPU accelerators. However, much of the effort around FEM on GPU’s has been focused on high order discretizations due to their higher arithmetic intensity and order of accuracy. For applications such as the simulation of geologic reservoirs, high levels of heterogeneity results in high-resolution grids characterized by highly discontinuous (cell-wise) material property fields. Additionally, the significant uncertainties typical of geologic reservoirs reduces the benefits of high order accuracy, and low order methods are typically employed. In this study, we present a strategy for implementing highly performant low-order matrix-free FEM operator kernels in the context of the conjugate gradient method. Performance results of the operator kernel are presented and are shown to compare favorably to matrix-based SpMV operators on V100, A100, and MI250X GPUs.

Workshop

Persistent Snapshot Isolation with Unlimited Reads on Commodity Hardware Transactional Memory

Data Movement and Memory

Fault Handling and Tolerance

Heterogeneous Computing

Security

DescriptionPersistent Memory (PM) has been proposed and commercially used as a novel generation of storage devices capable of competing with both primary and secondary memory, attaining features such as data persistency and byte addressability.

These devices paved the way for researchers to develop Transactional Memories (TMs) that, besides providing atomic transactions in main memory, because this memory can also be persistent, also deliver durable transactions. Unfortunately, combining PM and TM is challenging, as the most efficient implementations of TM, i.e., Hardware Transaction Memories (HTMs), operate at the level of volatile CPU caches.

We present our early-stage work on PSI, the first durable Persistent Hardware Transaction Memory for IBM's POWER systems. Our work builds on SI-HTM, which is a volatile HTM solution, and expands it with durability. We show that PSI imposes a relatively low overhead of 23% when compared with a volatile solution.

Workshop

Perspectives and Discussion

Programming Frameworks and System Software

State of the Practice

Workshop

Perspectives and Experiences Supporting Containers for Research Computing at the Texas Advanced Computing Center

DescriptionContainers are becoming essential to support the diversity of scientific computing workloads at academic computing centers. Here, we offer perspectives and experiences from the Texas Advanced Computing Center on: the installation, configuration, and support of select containerization platforms; incorporation of containers into the module system to improve their discoverability and usability; facilitation of advanced use cases including MPI containers, GPU containers, and support for multiple instruction set architectures; and finally instruction on best practices to end users through workshops and university courses. We will briefly discuss case studies that highlight the importance of supporting containers for research computing.

Paper

Phases, Modalities, Spatial and Temporal Locality: Domain Specific ML Prefetcher for Accelerating Graph Analytics

Architecture and Networks

Data Movement and Memory

Graph Algorithms and Frameworks

Performance Measurement, Modeling, and Tools

Programming Frameworks and System Software

DescriptionMemory performance is a bottleneck in graph analytics acceleration. Existing Machine Learning (ML) prefetchers struggle with phase transitions and irregular memory accesses in graph processing. We propose MPGraph, an ML-based Prefetcher for Graph analytics using domain specific models. MPGraph introduces three novel optimizations: soft detection for phase transitions, phase-specific multi-modality models for access delta and page predictions, and chain spatio-temporal prefetching (CSTP) for prefetch control.

Our transition detector achieves 34.17–82.15% higher precision compared with Kolmogorov–Smirnov Windowing and decision tree. Our predictors achieve 6.80–16.02% higher F1-score for delta and 11.68–15.41% higher accuracy-at-10 for page prediction compared with LSTM and vanilla attention models. Using CSTP, MPGraph achieves 12.52–21.23% IPC improvement, outperforming state-of-the-art non-ML prefetcher BO by 7.58–12.03% and ML-based prefetchers Voyager and TransFetch by 3.27–4.58%. For practical implementation, we demonstrate MPGraph using compressed models with reduced latency shows significantly superior accuracy and coverage compared with BO, leading to 3.58% higher IPC improvement.

Workshop

Physical Oscillator Model for Supercomputing

Modeling and Simulation

Performance Measurement, Modeling, and Tools

DescriptionA parallel program together with the parallel hardware it runs on is not only a vehicle to solve numerical problems, it is also a complex system with interesting dynamical behavior: resynchronization and desynchronization of parallel processes, propagating phases of idleness, and the peculiar effects of noise and system topology are just a few examples. We propose a physical oscillator model (POM) to describe the dynamics of interacting parallel processes. A process with its regular compute-communicate cycles is modeled as an oscillator which is coupled to other oscillators (processes) via an interaction potential. Instead of a simple all-to-all connectivity as in the standard well-known Kuramoto model, we employ a sparse topology matrix mapping the communication structure of the parallel program onto the oscillator setup. We show that the POM with appropriate potentials can mimic the propagation of delays and the synchronization and desynchronization behavior of scalable and bottlenecked parallel programs, respectively.

Invited Talk

Physics-Aware, Full-Stack Quantum Software Optimizations

Compilers

Hardware Technologies

Quantum Computing

DescriptionQuantum software can be a force multiplier that can significantly shorten the timeline for utility-scale results from quantum hardware. In particular, several key research directions will help realize practical quantum advantage. Physics-aware, cross-layer optimizations will continue to yield important efficiencies to allow applications to make the most of quantum resources. Software-directed error mitigation, in particular, will be key to increasing gate depths and maintaining acceptable output fidelity. Pulse-level optimizations and specialized native gates will also be key enablers. Additionally, applications will be hybrid computations involving high-performance classical resources as well as quantum hardware serving as special-purpose accelerators. Effectively partitioning computations between these classical and quantum resources will be necessary to support realistic applications. Additionally, deep compiler optimization and classical simulation of Clifford and near-Clifford circuits can also be important classical investments toward more efficient quantum computations. Finally, defining abstractions that control compiler complexity yet selectively expose key physical machine properties will also be a key area of research.

Posters

Research Posters

Pipit: Simplifying Analysis of Parallel Execution Traces

XO/EX

DescriptionPer-process per-thread traces enable in-depth analysis of parallel program execution to identify various kinds of performance issues. Often times, trace collection tools provide a graphical tool to analyze the trace output. However, these GUI-based tools only support specific file formats, are difficult to scale when the data is large, limit data exploration to the implemented graphical views, and do not support automated comparisons of two or more datasets. In this poster, we present a pandas-based Python library, Pipit, which can read traces in different file formats (OTF2, HPCToolkit, Projections, Nsight, etc.) and provide a uniform data structure in the form of a pandas DataFrame. Pipit provides operations to aggregate, filter, and transform the events in a trace to present the data in different ways. We also provide several functions to quickly identify performance issues in parallel executions.

Workshop

PM100: A Job Power Consumption Dataset of a Large-Scale Production HPC System

Artificial Intelligence/Machine Learning

Energy Efficiency

Green Computing

Performance Measurement, Modeling, and Tools

Sustainability

DescriptionThe power requirements of modern High-Performance Computing (HPC) systems pose environmental and financial challenges, given their carbon emissions and strain power grids. Optimizing power consumption together with system performance has thus become crucial. As jobs running on a system contribute to the whole system's power usage, predicting their power requirements before execution would allow forecasting the overall power consumption and perform techniques like power capping. Such predictive studies need quality data, which is limited due to the inherent complexity of collecting structured data in a production system. This paper aims to fill the lack of resources for job power prediction and provide (i) a methodology to create a job power consumption dataset from workload manager data and node power metrics logs, and (ii) a novel dataset comprising around 230K jobs and their corresponding power consumption values. The dataset is derived from M100, a holistic dataset extracted from a production supercomputer.

Workshop

PMBS23 – Afternoon Break

Modeling and Simulation

Performance Measurement, Modeling, and Tools

Workshop

PMBS23 – Lunch Break

Modeling and Simulation

Performance Measurement, Modeling, and Tools

Workshop

PMBS23 – Morning Break

Modeling and Simulation

Performance Measurement, Modeling, and Tools

Workshop

PMBS23 – Welcome

Modeling and Simulation

Performance Measurement, Modeling, and Tools

Workshop

PMBS23: The 14th International Workshop on Performance Modeling, Benchmarking, and Simulation of High-Performance Computer Systems

Modeling and Simulation

Performance Measurement, Modeling, and Tools

DescriptionThe PMBS23 workshop is concerned with the comparison of high-performance computing systems through performance modeling, benchmarking or through the use of tools such as simulators. We are particularly interested in research which reports the ability to measure and make tradeoffs in software/hardware co-design to improve sustained application performance. We are also keen to capture the assessment of future systems.

The aim of this workshop is to bring together researchers, from industry and academia, concerned with the qualitative and quantitative evaluation and modeling of high-performance computing systems. Authors are invited to submit novel research in all areas of performance modeling, benchmarking and simulation, and we welcome research that brings together current theory and practice. We recognize that the term 'performance' has broadened to include power consumption and reliability, and that performance modeling is practiced through analytical methods and approaches based on software tools and simulators.

Workshop

PoliMOR: A Policy Engine "Made-to-Order" for Automated and Scalable Data Management in Lustre

Data Analysis, Visualization, and Storage

Data Movement and Memory

DescriptionModern supercomputing systems are increasingly reliant on hierarchical, multi-tiered file and storage system architectures due to cost-performance-capacity trade-offs. Within such multi-tiered systems, data management services are required to maintain healthy utilization, performance, and capacity levels. We present PoliMOR, a pragmatic and reliable policy-driven data management framework. PoliMOR is composed of modular, single-purpose agents that gather file system metadata and enforce policies on storage systems. PoliMOR facilitates automated and scalable data management with customizable agents tailored to HPC facility-specific storage systems and policies. Our evaluations demonstrate the scalability and performance of PoliMOR both by its individual agents and as a collective entity. We believe PoliMOR is widely applicable across HPC facilities with large-scale data management challenges and will garner interest from the HPC community, given its flexible and open-source nature.

Paper

Portable and Scalable All-Electron Quantum Perturbation Simulations on Exascale Supercomputers

Applications

Modeling and Simulation

DescriptionQuantum perturbation theory is pivotal in determining the critical physical properties of materials. The first-principles computations of these properties have yielded profound and quantitative insights in diverse domains of chemistry and physics.

In this work, we propose a portable and scalable OpenCL implementation for quantum perturbation theory, which can be generalized across various high-performance computing (HPC) systems. Optimal portability is realized through the utilization of a cross-platform unified interface and a collection of performance-portable heterogeneous optimizations. Exceptional scalability is attained by addressing major constraints on memory and communication, employing a locality-enhanced task mapping strategy and a packed hierarchical collective communication scheme. Experiments on two advanced supercomputers demonstrate that the quantum perturbation calculation exhibits remarkably performance on various material systems, scaling the system to 200,000 atoms with all-electron precision. This research enables all-electron quantum perturbation simulations on substantially larger molecular scales, with a potentially significant impact on progress in material sciences.

Tutorial

Portable GPU Acceleration of HPC Applications with Standard C++

Accelerators

Applications

Software Engineering

TUT

DescriptionThis hands-on tutorial teaches how to parallelize and optimize HPC applications for multi-core CPUs and GPUs using the portable parallelism and concurrency features of the ISO C++23 standard without any language or vendor extensions. We further show how to integrate this approach with MPI to target large multi-node homogeneous and heterogeneous HPC systems. The attendees learn problem-solving strategies for parallelizing classic HPC patterns (multi-dimensional loops, map-reduce, scans) and concurrency problems, e.g., to hide the latency of MPI communication behind computation. The tutorial provides attendees zero-setup web access to Jupyter Lab running on modern multi-GPU accelerated systems, enabling attendees to solve the hands-on exercises directly in their web browser. These hands-on exercises transfer the above mentioned technique to produce a portable multi-node, heterogeneous, and asynchronous 2D unsteady heat-equation mini-application. Finally, we synthesize practical techniques acquired from our professional experience applying the portable ISO C++23 parallel and asynchronous programming models to port large real-world HPC applications to heterogeneous supercomputers and refer further learning resources.

Workshop

Porting and Optimizing Meso-NH to AMD MI250X GPUs

Accelerators

Applications

Compilers

Heterogeneous Computing

Modeling and Simulation

Programming Frameworks and System Software

Runtime Systems

DescriptionThis paper presents the results of our efforts to port Meso-NH, an atmospheric non-hydrostatic research model, to AMD MI250X GPUs using OpenACC on the ADASTRA Machine, a technology similar to the Frontier system [1]. Meso-NH is a versatile model that covers a wide range of resolutions from synoptic to turbulent scales, and is designed for studies of physics and chemistry. Numerical simulation of the atmosphere is crucial for understanding and predicting weather and climate extremes. Current numerical weather prediction codes are limited to specific resolutions on global and regional scales. The Meso-NH code, however, tackles scales and complexities beyond what is typically used in operational forecasting.

We collaborated with GENCI, CINES, HPE, and AMD on the "progress contract," for ADASTRA machine aiming to achieve simulations at hectometric resolution for recent storms in the Atlantic and Mediterranean regions, characterized by extreme wind gusts.

Workshop

Porting Batched Iterative Solvers onto Intel GPUs with SYCL

Performance Measurement, Modeling, and Tools

Performance Optimization

DescriptionBatched linear solvers play a vital role in computational sciences, especially in the fields of plasma physics and combustion simulations. With the imminent deployment of the Aurora Supercomputer and other upcoming systems equipped with Intel GPUs, there is a compelling demand to expand the capabilities of these solvers for Intel GPU architectures.

We present our efforts in porting and optimizing the batched iterative solvers on Intel GPUs using the SYCL programming model. These new solvers achieve impressive performance on the Intel GPU Max 1550s (Ponte Vecchio GPUs) which surpass our previous CUDA implementation on NVIDIA H100 GPUs by an average of 2.4x for the PeleLM application inputs. The batched solvers are ready for production use in real-world scientific applications through the Ginkgo library, complementing the performance portability of the batched functionality of Ginkgo.

Workshop

Potential of Cryogenics Electronics for Future Computing Systems

State of the Practice

DescriptionCryogenic electronics have great potential to advance computing capabilities and quantum information processing. We explore two categories: Superconducting Electronics (SCE) and Cryogenic semiconductor electronics (Cryo-Semi). Taking advantage of the inherent phenomenon of superconductivity with zero resistance and Josephson junctions, SCE presents notable advantages in energy efficiency, minimal power dissipation, and gigahertz processing speed. Similarly, cryo-semi electronics offers a compelling advantage over their room temperature counterpart; these include lower noise, higher operating speed, increased efficiency, and wide-range operating temperature. Both SCE and cryo-semi exhibit compatibility with quantum technologies and deep space applications. Both have the capacity to operate at deep cryogenic temperatures, making them an appealing candidate for quantum computing. Their integration enables the seamless integration of classical and quantum resources in a shared cryogenic environment, improving quantum error correction and operating temperature compatibility.

Workshop

Power Analysis of NERSC Production Workloads

Modeling and Simulation

Performance Measurement, Modeling, and Tools

DescriptionPower has become a key limiting factor in supercomputing. Understanding the power signatures of current production workloads is essential to address this limit and continue to advance scientific computing at scale. This paper analyzes the power characteristics of NERSC production workloads at the system and application levels. Our system-level analysis revealed a large gap between the average and peak power usage distribution, indicating a significant power swing from running various applications on the system. On the application level, we select four workflow benchmarks representing NERSC's production workloads to analyze the power characteristics of applications and attempt to correlate the observed power timeline patterns with GPU performance metrics and application profiling data. We found different applications have distinct power usage patterns and widespread average and peak power usage. We discuss how these findings may help improve the current system's operational power efficiency and the implications for future system procurement.

Birds of a Feather

Power Consumption and Exascale Computing: Toward a “Short Production Circuit” Model

Architecture and Networks

XO/EX

DescriptionSince the “Good Old Times” of petascale, HPC computer centers have had to evolve a lot. As a result, HPC centers require huge amounts of power to run, and TCO has gone through the roof, in particular in times of rising energy prices.

This BoF proposes a “short production system” compute model, where the compute, storage and network systems collaborate to execute applications and workflows in a small, compact and contiguous part of the system, exploit locality of compute and data resources, and thereby reduce energy usage and cost and avoid spreading applications and data across the whole system.

Workshop

Precision and Performance Analysis of C Standard Math Library Functions on GPUs

Compilers

Heterogeneous Computing

Performance Optimization

DescriptionWith the advent of GPU computing, executing large program sections on accelerators has become increasingly important. Efforts are being made to support the C standard library, LIBC, on GPUs via LLVM machinery. Therefore, the C standard math library, LIBM, must be supported on GPUs. So far, LLVM frontends, such as Clang, have relied on GPU vendor implementations of LIBM functionality wrapped into (mostly) LIBM-compatible forwarding functions.

We propose a novel LIBM for GPUs reusing a collection of LLVM target-agnostic implementations and built-ins alongside vendor implementations of most single and double-precision floating point math functions. Our approach allows selecting between individual implementations based on the GPU target as opposed to the current approach, which serves only the single third-party library implementation. Our extensive numerical analysis highlights the various implementations' differences in performance and precision. Our solution allows users to choose the implementation that maximizes speed while meeting their specific precision requirements.

Doctoral Showcase

Posters

Preemptive Intrusion Detection: Real-World Measurements, Bayesian-Based Detection, and AI-Driven Countermeasures

Artificial Intelligence/Machine Learning

Security

DescriptionThe problem of preempting attacks before damages remains the top security priority. The gap between alerts and early detection remains wide open because noisy attack attempts and unreliable alerts mask real attacks from humans. This dissertation brings together: 1) attack patterns mining driven by real security incidents, 2) probabilistic graphical models linking patterns with runtime alerts, and 3) an in vivo testbed which embeds a honeypot in a live Science DMZ network for realistic assessment. Traditional techniques that seek specific attack signatures or anomalies are ineffective because defenders only see a partial view of ongoing attacks while having to wrestle with unreliable alerts and heavy background noise of attack attempts. In contrast, our principle objective is to reinforce scant, incomplete evidence of potential attacks with the ground truth of past security incidents. We evaluated our system, Cyborg's, accuracy, and performance in three experiments at the National Center for Supercomputing Applications at the University of Illinois. Our deployment stops 8 out of 10 replayed attacks before system integrity violation and all ten before data exfiltration. In addition, we discovered and stopped a family of ransomware attacks before the data breach. During the period of deployment, this thesis resulted in a honeypot that collected 15 billion attack attempts (the world's largest publicly analyzed dataset) for analytics. In the future, we are looking at integrating AI techniques such as large language models to build intelligent honeypot systems that are indistinguishable from real systems to collect attack intelligence and educate the security operator.

Workshop

Preemptive Scheduling of Stateful GPU-Intensive HPC Applications in Kubernetes

DescriptionContainers provide a new paradigm for building, packaging, deploying and managing applications consistently across varying infrastructures. However, the utilization of containers in HPC has been more difficult due to the culmination of security and performance requirements. High resource utilization across GPU-intensive workloads is a crucial requirement for HPC clusters. Container orchestration platforms such as Kubernetes enable efficient management of HPC infrastructure for use by researchers who need access to scalable high performance facilities. However, the resource utilization of such orchestration frameworks with GPU-intensive HPC workloads remains relatively unexplored. In this paper, we present kube-criu-scheduler, a Kubernetes scheduler that builds on a recently introduced container checkpointing feature to enable preemptive scheduling of GPU-accelerated HPC applications. Our results show that resulting efficiency and reliability gains are achieved with negligible impact on application performance.

Posters

Research Posters

Preserving Data Locality in Multidimensional Variational Quantum Classification

Artificial Intelligence/Machine Learning

Post-Moore Computing

Quantum Computing

DescriptionIn classical machine learning, the convolution operation is leveraged in the eponymous class of convolutional neural networks (CNNs) capturing the spatial and/or temporal locality of multidimensional input features. Preserving data locality allows CNN models to reduce the number of training parameters, and hence their training time, while achieving high classification accuracy. However, contemporary methods of quantum machine learning do not possess effective methods for exploiting data locality, due to the lack of a generalized and parameterizable implementation of quantum convolution. In this work, we propose variational quantum classification techniques that leverage a novel multidimensional quantum convolution operation with arbitrary filtering and unity stride. We provide the quantum circuits for our techniques alongside corresponding theoretical analysis. We also experimentally demonstrate the advantage of our method in comparison with existing quantum and classical techniques for image classification in staple multidimensional datasets using state-of-the-art quantum simulations.

Tutorial

Principles and Practice of High Performance Deep/Machine Learning Training and Inference

Artificial Intelligence/Machine Learning

TUT

DescriptionRecent advances in Machine and Deep Learning (ML/DL) have led to many exciting challenges and opportunities. Modern ML/DL frameworks including TensorFlow, PyTorch, and cuML enable high-performance training, inference, and deployment for various types of ML models and Deep Neural Networks (DNNs). This tutorial provides an overview of recent trends in ML/DL and the role of cutting-edge hardware architectures and interconnects in moving the field forward. We will also present an overview of different DNN architectures, ML/DL frameworks, DL Training and Inference, and Hyperparameter Optimization with special focus on parallelization strategies for model training. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to efficiently support large-scale distributed training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU/GPU/DPU architectures available on modern HPC clusters. Throughout the tutorial, we include several hands-on exercises to enable attendees to gain first-hand experience of running distributed ML/DL training and hyperparameter optimizations on a modern GPU cluster.

Workshop

Principles for Automated and Reproducible Benchmarking

Applications

Exascale

Heterogeneous Computing

Programming Frameworks and System Software

State of the Practice

DescriptionThe diversity in processor technology used by High Performance Computing (HPC) facilities is growing, and so applications must be written in such a way that they can attain high levels of performance across a range of different CPUs, GPUs, and other accelerators. Measuring application performance across this wide range of platforms becomes crucial, but there are significant challenges to do this rigorously, in a time efficient way, while assuring results are scientifically meaningful, reproducible, and actionable. We present a methodology for measuring and analyzing the performance portability of a parallel application and shares a software framework which combines and extends adopted technologies to provide a usable benchmarking tool. We demonstrate the flexibility and effectiveness of the methodology and benchmarking framework by showcasing a variety of benchmarking case studies which utilize a stable of supercomputing resources at a national scale.

Paper

Prodigy: Toward Unsupervised Anomaly Detection in Production HPC Systems

Architecture and Networks

Performance Measurement, Modeling, and Tools

Resource Management

DescriptionPerformance variations caused by anomalies in modern High Performance Computing (HPC) systems lead to decreased efficiency, impaired application performance, and increased operational costs. While machine learning (ML)-based frameworks for automated anomaly detection (often based on time series telemetry data) are gaining popularity in the literature, practical deployment challenges are often overlooked. Some ML-based frameworks require extensive customization, while others need a rich set of labeled samples, none of which are feasible for a production HPC system.

This paper introduces a variational autoencoder-based anomaly detection framework, Prodigy, that outperforms the state-of-the-art alternatives by achieving a 0.95 F1-score when detecting performance anomalies. The paper also provides a real system implementation of Prodigy that enables easy integration with monitoring frameworks and rapid deployment. We deploy Prodigy on a production HPC system and demonstrate 88% accuracy in detecting anomalies. Prodigy involves an interface to provide job- and node-level analysis and explanations for anomaly predictions.

Workshop

Program Your Favorite Data Science Pipeline in Spark

Education

State of the Practice

DescriptionDesigned for the master's degree program in "Computational and Data Science," the Faculty of Mathematics and Computer Science at Friedrich Schiller University Jena, Germany, offers a course that introduces students to distributed processing on massive datasets. Within that course, there is a three-week programming project where students learn to design, construct, and improve data analysis and machine learning pipelines using Hadoop, MapReduce, and Spark on the university’s central compute cluster. This short note sketches the main idea of the programming project, gives an example of a project instance, and reports on classroom experiences.

Workshop

Programming Model for Habana/Gaudi2 Accelerators and Its Impact on Deep Learning Inference/Training Performance at Scale

Large Scale Systems

Middleware and System Software

Programming Frameworks and System Software

DescriptionI will discuss the multi-stream based execution environment of Habana/Gaudi systems that is exposed to deep learning frameworks and I will show how one can combine compute, networking and DMA at high performance and with low run-time overheads. I will highlight the performance of Habana Collective Communication Library at scale in terms of bandwidth, message rate and demonstrate its impact on deep learning training and inference performance of a few neural network models including vision and Large Language Models. In the second part of the talk, I will highlight the challenges in communication scaling, especially the associated congestion that we observe between leaf and spine switches in certain conditions. I will highlight solutions that we are currently deploying including congestion control algorithms and packet/message spraying techniques at the endpoint and share our results.

Tutorial

Programming Novel AI Accelerators for Scientific Computing

Accelerators

Artificial Intelligence/Machine Learning

TUT

DescriptionScientific applications are increasingly adopting Artificial Intelligence (AI) techniques to advance science. There are specialized hardware accelerators designed and built to run AI applications efficiently. With a wide diversity in the hardware architectures and software stacks of these systems, it is challenging to understand the differences between these accelerators, their capabilities, programming approaches, and how they perform, particularly for scientific applications. In this tutorial, we will cover an overview of the AI accelerators landscape with a focus on SambaNova, Cerebras, Graphcore, Groq, and Habana systems along with architectural features and details of their software stacks. We will have hands-on exercises that will help attendees understand how to program these systems by learning how to refactor codes written in standard AI framework implementations and compile and run the models on these systems. The tutorial will enable the attendees with an understanding of the key capabilities of emerging AI accelerators and their performance implications for scientific applications.

Tutorial

Programming Your GPU with OpenMP: A “Hands-On” Introduction

Accelerators

Algorithms

Task Parallelism

TUT

DescriptionIf you are an HPC programmer, you know OpenMP. Alongside MPI, OpenMP is the open, cross-vendor foundation of HPC. As hardware complexity has grown, OpenMP has grown as well adding GPU support in OpenMP 4.0 (2013). With a decade of evolution since then, OpenMP GPU technology is now a mature option for programming any GPU you are likely to find on the market.

While there are many ways to program a GPU, the best way is through OpenMP. Why? Because the GPU does not exist in isolation. There are always one or more CPUs on a node. Programmers need portable code that fully exploits all available processors. In other words, programmers need a programming model, such as OpenMP, that fully embraces heterogeneity.

In this tutorial, we explore GPU programing with OpenMP. We assume attendees already know the fundamentals of multi-threading with OpenMP, so we use our time on the directives that define how to map loops onto GPUs and optimize data movement between the CPU and GPU. Students will use their own laptops (with Windows, Linux, or macOS) to connect to remote servers we will provide with GPUs and all the software needed for the tutorial.

Awards

Test of Time

Progress Since the Presentation of “Millisecond-Scale Molecular Dynamics Simulations on Anton” at SC09

Applications

Architecture and Networks

Codesign

DescriptionIn 2009, we presented a paper at SC09 reporting on the design, construction, and use of Anton 1, a special-purpose supercomputer designed for molecular dynamics (MD) simulations of biomolecular systems. The machine’s specialized hardware dramatically increased the speed of MD calculations, making possible for the first time the simulation of biological molecules at an atomic level of detail for periods on the order of a millisecond -- about two orders of magnitude beyond the previous state of the art. This enabled biomolecular simulations on a timescale at which many critically important, but poorly understood phenomena were known to occur, allowing the observation of biological phenomena that were previously inaccessible to both computational and experimental study.

The following year, we published a paper in the journal Science that reported on our use of Anton 1 to answer longstanding fundamental questions regarding the nature of protein folding and other large-scale structural changes in proteins. Some of our results were derived from an Anton 1 simulation roughly 100 times longer than the longest simulation that had previously been reported.

Over the past 10 years, we have developed and deployed two further generations of the Anton supercomputer. Anton 3 is dramatically faster than Anton 1, and capable of efficiently handling far larger molecular systems. Although general-purpose supercomputers have also become much faster over that period, the performance gap between Anton and general-purpose supercomputers has grown over time -- to a factor of more than 400 for biomolecular systems in a size range of considerable interest within the research and drug discovery communities.

In 2010, we made an Anton 1 machine (later upgraded to an Anton 2) available without cost for noncommercial research use by universities and other nonprofit institutions. Anton time is allocated annually by the National Academies, and a total of 239 outside research groups have thus far conducted independent research projects on these machines.

At D. E. Shaw Research, we have continued using Anton machines both for fundamental research and for internal and collaborative drug discovery, yielding six drugs that are currently in human clinical trials.

This talk will review the contents of our original paper and related progress since its publication.

Workshop

Protein Generation via Genome-Scale Language Models with Bio-Physical Scoring

Artificial Intelligence/Machine Learning

DescriptionLarge language models (LLMs) trained on vast biological datasets can learn biological motifs and correlations across the evolutionary landscape of natural proteins. LLMs can then be used for de novo design of novel proteins with specific structures, functions, and physicochemical properties. We employ a pre-trained genome-scale language model that uses codons as tokens and integrate it into a workflow for targeted generation of sequences. Our framework suggests new gene sequences that are ranked for downstream evaluation by metrics that collectively capture extensive sequence-specific, biophysical, and biochemical properties. We demonstrate our integrated workflow to design novel variants of the enzyme, malate dehydrogenase (MDH), that exhibit more favorable activation energies than their natural counterparts (reduction of 4.01 kJ/mol) with sustained sequence generation rates of 10^4/hr and simulation rates of 10^2/hr on 64 nodes of Polaris with about 99.7% system utilization during the run.

Workshop

ProTools 2023 – Morning Break

Performance Measurement, Modeling, and Tools

Programming Frameworks and System Software

Workshop

Prototype of a Batched Quantum Circuit Simulator for the Vector Engine

Quantum Computing

Software Engineering

DescriptionState-of-the-art quantum circuit simulators have mostly focused on scaling the number of qubits. However, we argue that studying current noisy quantum computers and variational quantum algorithms benefits from high-throughput simulation of intermediate-scale quantum circuits. We present the first implementation and evaluation of a batched quantum simulator on the NEC Vector Engine (VE), a vector processor with massive memory bandwidth ideal for memory-intensive state vector simulation. To take advantage of the long-vector architecture of VE, we design a parallelization strategy and memory layout suited for batched state vector simulation. Our preliminary evaluation shows that the performance of our simulator on VE Type 20B outperforms a dual-socket CPU system by 12x. Furthermore, the performance of VE is identical to that of cuStateVec on A100 40 GB, matching the peak bandwidth of the two processors. This suggests the latest VE with higher memory bandwidth is expected to outperform A100.

Birds of a Feather

Providing a Unified User Interface and Experience for Geographically Dispersed Computing Resources

Distributed Computing

XO/EX

DescriptionThis BoF session will address user experience challenges that arise from geographically dispersed computing resources, such as when an organization operates multiple HPC clusters or wishes to combine on-premises and cloud-based compute services. A series of speakers will provide an overview of current perspectives on and solutions for making dispersed computing resources available to user communities. We invite participants to engage in a facilitated follow-up discussion to identify key unresolved hurdles and document emerging community best practices for providing the best possible user experience in geographically dispersed HPC settings.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

ProxyStreams: Leveraging Lightweight Proxies for Portable Streams

XO/EX

DescriptionA novel streaming approach is introduced for Python, leveraging the ProxyStore system to facilitate the exchange of stream references across distributed systems. This approach utilizes generators to efficiently publish and consume messages from streams. The extensible backend connector interface of ProxyStore enables support for diverse communication mechanisms, such as transitioning from ZMQ to RDMA. Performance results highlight the capability to perform data PUT and GET operations on streams with minimal overhead and high efficiency.

Students@SC

Psychological Safety Workshop

TUT

XO/EX

DescriptionThe research is clear – psychological safety is more critical than any other factor for making a teamwork. A shared belief held by individuals that their team is safe for interpersonal risk-taking isn’t just “nice to have” but a necessity for company growth, employee retention, and long-term success.

In this 1.5-hour session, participants will discuss the challenges and successes they’ve encountered while working to foster a greater sense of psychological safety in their teams and organizations. The host will share specific actions people leaders can take to create environments that welcome questions, express unique perspectives, and allow people to safely bring their authentic selves to work based on Dr. Clark’s 4-Stage framework.

Workshop

PTI-GPU: Kernel Profiling and Assessment on Intel GPUs

Programming Frameworks and System Software

DescriptionModern supercomputing applications are complex programs built on optimized frameworks and accelerated on GPUs. As such, dedicated tools for profiling GPU kernel utilization and performance are needed to support development of these applications, which in turn accelerates progress for the scientific computing and machine learning communities.

This paper presents the Oneprof and Onetrace tools from the Intel PTI-GPU framework. These tools are capable of profiling applications and different levels of the runtime stack executing on Intel GPUs. To demonstrate the features and utility of these tools, we examine one HPC and one AI application.

Workshop

Pure: Evolving Message Passing To Better Leverage Shared Memory within Nodes

Applications

Distributed Computing

Compilers

Heterogeneous Computing

Message Passing

Programming Frameworks and System Software

Runtime Systems

Task Parallelism

DescriptionPure is a new programming model and runtime system explicitly designed to take advantage of shared memory within nodes in the context of a mostly message passing interface enhanced with the ability to use tasks to make use of idle cores. We use microbenchmarks to evaluate Pure’s key messaging and collective features and also show application speedups up to 2.1x on the CoMD molecular dynamics application. Overall, Pure offers improved performance by aggressively leveraging modern shared memory nodes with a programming model that will be familiar to MPI programmers.

Workshop

Q&A and Discussion

Education

State of the Practice

Workshop

Q&A and Discussion

Education

State of the Practice

Workshop

QArchSearch: A Scalable Quantum Architecture Search Package

Quantum Computing

Software Engineering

DescriptionThe current era of quantum computing has yielded several algorithms that promise high computational efficiency. While the algorithms are sound in theory, there is little guidance on how to design proper quantum circuits to realize the appropriate unitary transformation to be applied to the input quantum state. We present QArchSearch, an AI based quantum architecture search package with the QTensor library as a backend that provides a principled and automated approach to finding the best model given a task and input quantum state. We show that the search package is able to efficiently scale the search to large quantum circuits and enables exploration of more complex models for different quantum applications. QArchSearch runs at scale and high efficiency on high-performance computing systems using two-level parallelization scheme on both CPUs and GPUs, which has been demonstrated on the Polaris supercomputer .

Posters

Research Posters

QASM-to-HLS: A Framework for Accelerating Quantum Circuit Emulation on High-Performance Reconfigurable Computers

XO/EX

DescriptionHigh-performance reconfigurable computers (HPRCs) make use of Field-Programmable Gate Arrays (FPGAs) for efficient emulation of quantum algorithms. Generally, algorithm-specific architectures are implemented on the FPGAs, and there is very little flexibility. Moreover, mapping a quantum algorithm onto its equivalent FPGA emulation architecture is challenging. In this work, we present an automation framework for converting quantum algorithms/circuits to their equivalent FPGA emulation architectures. The framework processes quantum circuits represented in Quantum Assembly Language (QASM) and derives high-level descriptions of the hardware emulation architectures for High-Level Synthesis (HLS) on HPRCs. Experimental results show that the framework-generated architectures deployed on an HPRC perform faster than a state-of-the-art software simulator.

Workshop

QASMTrans: A QASM Quantum Transpiler Framework for NISQ Devices

Quantum Computing

Software Engineering

DescriptionIn the field of Quantum Computing, transpilation plays a crucial role in converting high-level quantum circuits into versions that are specific to the underlying quantum devices. This process necessitates a consideration of a range of factors, such as the basis gate set, device topology, error profile, and other constraints. Yet, the efficiency of transpilation remains a significant bottleneck, particularly when dealing with very large assembly or QASM-level input files. In this paper, we present QASMTrans, a C++ based high-performance quantum transpiler. NWQTrans has demonstrated significant efficiency improvements compared to widely adopted approaches such as Qiskit. Built on comprehensive transpilation principles and efficient computation techniques, QASMTrans demonstrates 8-368X speedups on average compared to the internal transpiler of Qiskit. Such tremendous speedups thus make the exploration of much larger design space, as well as more comprehensive compiler optimizations, become feasible, especially for large circuits. QASMTrans will be released at http://github.com/pnnl/qasmtrans.

Panel

Quantum Computing and HPC: Opportunities and Challenges for New Companies in the Field of HPC

Codesign

Quantum Computing

DescriptionQuantum Computing is quickly maturing and has started to enter the area of High-Performance Computing. As a consequence, we are seeing more and more work on quantum computing in the SC program and also more and more exhibitors focusing on this new technology and its relationship to HPC. This, however, comes with many challenges, especially for new companies in this field, as they have to bridge the gap between physics and computer science, both from a technology and a community point of view. In this panel, we will discuss this topic with five quantum computing companies covering hardware, software and workflow aspects, their take on the impact of HPC on them as well their impact on HPC, special challenges, and the future prospects of quantum computing as a new accelerator technology for HPC.

Posters

Research Posters

Quantum Computing Case Study in Aerospace Field

XO/EX

DescriptionWith the demise of Moore’s empirical law, we cannot expect a dramatic improvement in computer performance in the future, but the need for supercomputer in JAXA for numerical simulation and data processing etc. continues to rise. Until now, general-purpose CPUs have been exclusively used, but there is an urgent need to seriously consider the use of dedicated computers and new architectures. One candidate is a quantum computer.

In order to study the feasibility of a quantum computer as a candidate for a new architecture, the Gate-Model Quantum Computer Study Group was established with JSS3: JAXA Supercomputer System generation 3 users as the main members, which examined the possibility of applying Gate-Model Quantum Computing technology to JAXA's technical problem areas, and will assist management in making mid to long-term decisions regarding computing resources.

The Group organized use cases created in workshops and gained insight into the effects of utilizing quantum technology.

Invited Talk

Quantum Computing Infrastructure and Advances

Compilers

Hardware Technologies

Quantum Computing

DescriptionThe expansion of several Quantum Computing platforms has already demonstrated some of the superior performance previously predicted. However, the ultimate goal of one day reaching end-users will require solving a multitude of difficult engineering problems. This talk focuses on some of the current microelectronics research aimed to make the necessary advances to improve Quantum Computing hardware. The next several generations of Quantum Computers must solve issues such as: scalable, fault tolerant hardware, microdevices with lower energy consumption, standardization of hardware across qubit types and development of devices that are less sensitive to environmental noise. Simultaneously, software platforms must make hardware accessible, while work force development must train users.

Workshop

Quantum Computing Software – Afternoon Break

Quantum Computing

Software Engineering

Workshop

Quantum Computing Software – Lunch Break

Quantum Computing

Software Engineering

Workshop

Quantum Computing Software – Morning Break

Quantum Computing

Software Engineering

Posters

Research Posters

Quantum Task Offloading with the OpenMP API

XO/EX

DescriptionMost of the widely used quantum programming languages and libraries are not designed for the tightly coupled nature of hybrid quantum-classical algorithms, which run on quantum resources that are integrated on-premise with classical HPC infrastructure. We propose a programming model using the API provided by OpenMP to target quantum devices, which provides an easy-to-use and efficient interface for HPC applications to utilize quantum compute resources. We have implemented a variational quantum eigensolver using the programming model, which has been tested using a classical simulator. We are in the process of testing on the quantum resources hosted at LRZ.

Exhibits

Flash Session

Quantum-Safe Networks: Today and into the Future

XO/EX

DescriptionNetwork engineers are closely watching the downside of quantum computer development: emergence of a cryptographically relevant quantum computer (CRQC). This talk will offer a better understanding of quantum-safe networks and show how to protect against attack immediately and well into the future.

Workshop

Queue Wait Time Prediction in Supercomputers

State of the Practice

DescriptionHigh Performance Computing systems play critical role in advancing scientific research. They use schedulers for allocating resources to queued jobs. Waiting time can vary, even among jobs of similar characteristics, making it difficult to accurately estimate the exact time a job will wait in the queue. Knowing how long a job will wait is beneficial for adequate planning and to avoid frustrations that may arise when a user's expectation of waiting time is not met. Efficient job wait time estimation is also crucial for optimizing resource allocation.

This work investigates the impact of job characteristics and user behaviors on job wait time on leadership-class HPC systems. The paper evaluates the performance of different supervised learning algorithms for job wait time estimation. While this study focuses on the workload and hardware characteristics of Theta CrayXC40, the processes and tools developed in this study can be applied to any other leadership-class machine.

Posters

Research Posters

Radium: Transparent Distributed Execution via Process Virtualization

XO/EX

DescriptionThe soaring demand for AI has led to a surge in specialized computation hardware, which poses challenges in sharing resources through conventional virtualization methods among end users. Moreover, the extensive data required by AI often cannot be conveniently co-located with the compute resources, resulting in costly and unsuitable migration attempts. To address these issues, Radium offers a userspace framework employing process virtualization, thread execution migration, and distributed shared memory. By leveraging Radium, an unmodified application binary operates in an encapsulated virtualized environment and its execution can be transparently distributed among nodes where resources are located. Radium enables resource aggregation with little performance penalty over high latency network connectivity. By choosing syscalls as the virtualization boundary, Radium supports novel hardware by nature without modifying existing infrastructure or applications.

Workshop

Ramble: A Flexible, Extensible, and Composable Experimentation Framework

Applications

Exascale

Heterogeneous Computing

Programming Frameworks and System Software

State of the Practice

DescriptionReproducibility and replicability are extremely important components of scientific computing, and any computational research. The ability to replicate a set of experiments aids many other computational use cases, such as systems acceptance where a compute center requires the ability to produce execute the same experiment as a hardware vendor. Several test harnesses and frameworks exist, and attempt to increase the replicability of these experiments.

We introduce Ramble. a new Python-based experimentation framework. Ramble provides a domain specific language for abstracting how experiments can be creating from applications, and a flexible templating engine for creating experiments. Ramble can be used for automating system tests, scientific parameter studies, performance focused benchmarking, and many other software experiments. We will introduce Ramble, describe its architecture, and give some concrete use cases where it can be applied to HPC application experimentation.

Paper

Rapid Simulations of Atmospheric Data Assimilation of Hourly-Scale Phenomena with Modern Neural Networks

Artificial Intelligence/Machine Learning

Applications

Modeling and Simulation

State of the Practice

DescriptionAtmospheric data assimilation is essential for numerical weather prediction. Ensemble-based data assimilation connects multiple instances of atmospheric model through Kalman-filter-based algorithm, which is regarded as a challenging computing task today. In this work, we present our efforts to build a fast, low-cost, and scalable atmospheric data assimilation prototype for the new-generation Sunway supercomputer, including (1) A UNet-neural-network-based surrogate model for atmospheric dynamic simulation to generate all the background ensemble with both satisfactory accuracy and reasonable robustness; (2) Batched LETKF with an efficient eigenvalue decomposition implementation and a data staging strategy to cover the observation IO time ; (3) A framework able to flexibly deploy the components, thus available to reach the maximum resource efficiency. Experimental evaluations show that our AI-integrated ensemble data assimilation prototype can finish hour-cycle assimilation in minutes, keep linear scalability and save an order of magnitude of computing resources compared with the traditional scientific method.

Workshop

RDARuntime: An OS for AI Accelerators

Middleware and System Software

Programming Frameworks and System Software

Runtime Systems

DescriptionToday's supercomputers are more heterogeneous than ever before. As the share of AI workloads in data centers continues to grow, the share of GPUs and AI-specific hardware grows with it. AI accelerators are different from traditional hardware, affecting all aspects of system design, from data-center scale to single-chip scale. AI accelerators are much more efficient than CPUs or GPUs for some HPC workloads, especially in AI for Science. They also add complexity to system architecture, management, and programming. Although runtime frameworks are critical to reducing system complexity, there is little literature describing AI accelerator runtimes. In this paper, we introduce RDARuntime - an AI-specific OS tailored for the development and operation of SambaNova's reconfigurable dataflow architecture. We introduce the architecture, our design decisions, and some of the results we have achieved, along with some lessons we have learned while helping to deploy the Reconfigurable Dataflow Unit (RDU) to production environments.

Posters

Research Posters

Real-Time Change Point Detection in Molecular Dynamics Streaming Data

XO/EX

DescriptionThe uniform sampling of molecular dynamics (MD) simulations may not accurately capture crucial scientific events. Deep learning approaches are being developed to detect these events within streaming data but can take significant resources on large datasets (PB+). To address these limitations, we propose a solution based on streaming manifold learning, specifically the Kernel CUSUM (KCUSUM) algorithm. By leveraging KCUSUM, we can overcome the limitations of uniform sampling in MD simulations, as it compares incoming data with samples from a reference distribution. It utilizes a statistic derived from the Maximum Mean Discrepancy (MMD) non-parametric testing framework. This algorithm has been tested in various use cases, demonstrating its ability to significantly reduce data rates without missing important scientific events.

Workshop

ReAPER: Region Aware Power and Energy Regulator

Energy Efficiency

Green Computing

Sustainability

DescriptionImproving CPU power/energy efficiency without degrading performance requires an accurate application characterization. Rather than characterizing the application as a whole, we find that dividing the application into individual regions is much more effective. This fine-grained approach gives us the opportunity to save power/energy during memory-bound regions and MPI slack regions (time spent waiting on other processes) by lowering core frequency and during compute-bound regions by lowering uncore frequency. We propose an intuitive, lightweight, and portable algorithm for identifying these regions at runtime which relies only on the IPS (instructions per second) metric, rather than on performance counters that can differ across platforms. At the same time, we meet a user-specified level of acceptable performance degradation by adapting core and uncore frequencies to the application, achieving additional CPU power/energy savings. We evaluate our approach on the SPEC 2017 benchmarks and various MPI applications.

Workshop

Recovery from Silent Data Corruption via Spatial Data Prediction

Fault Handling and Tolerance

Large Scale Systems

DescriptionHigh-performance computing applications are central to advancement in many fields of science and engineering. Central to this advancement is the supposed reliability of the HPC system. However, as system size grows and hardware components run with near-threshold voltages, transient upset events become more likely. Many works have explored the problem of detecting silent data corruption; however, recovery is often left to checkpoint-restart or application-specific techniques. Recovering from a checkpoint incurs overhead due to reading a checkpoint and recomputing lost work. Allowing the application to recover just the corrupted data enables faster and more efficient recovery. This paper explores using spatial similarities to recover from silent data corruption. We explore several reconstruction methods and evaluate their effectiveness at recovering corrupted entries in data arrays. Results show that the Lorenzo 1-Layer prediction method yields the best results, with over half of its reconstructions having less than 1% relative error across all applications.

Workshop

Reducing HPC Energy Footprint for Large Scale GPU Accelerated Workloads

Energy Efficiency

Green Computing

Sustainability

DescriptionAs the energy cost continues to rise, HPC centers need to reduce their energy footprint. We examine a French national machine hosted at CINES in Montpellier, Adastra, based on AMD-MI250X GPUs and #3 in Green500. As a base for the study, we define a set of applications representative of our current HPC/AI production workload. In this parametric study, we characterize our diverse workload by applying a range of frequency or power capping policies at the node level in order to build an efficiency profile of each application. Based on the collected results, we produce guidelines trading between pure energy savings to pure performance for each application and for the production workload as a whole. We hope the results of this study will be of help to accelerators enabled HPC centers seeking to reduce their energy footprint by applying capping policies on their accelerators at the node level.

Workshop

Reducing Memory Requirements for the IPU Using Butterfly Factorizations

Modeling and Simulation

Performance Measurement, Modeling, and Tools

DescriptionHigh Performance Computing (HPC) benefits from different improvements during last decades, specially in terms of hardware platforms to provide more processing power while maintaining the power consumption at a reasonable level. The Intelligence Processing Unit (IPU) is a new type of massively parallel processor, designed to speedup parallel computations with huge number of processing cores and on-chip memory components connected with high-speed fabrics. IPUs mainly target machine learning applications, however, due to the architectural differences between GPUs and IPUs, especially significantly less memory capacity on an IPU, methods for reducing model size by sparsification have to be considered. Butterfly factorizations are well-known replacements for fully-connected and convolutional layers. We examine how butterfly structures can be implemented on an IPU and study their behavior and performance compared to a GPU.

Paper

ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear Solvers

Algorithms

Linear Algebra

Post-Moore Computing

DescriptionResistive random access memory (ReRAM) is a promising technology that can perform low-cost and in-situ matrix-vector multiplication (MVM) in analog domain. Scientific computing requires high-precision floating-point (FP) processing. However, performing floating-point computation in ReRAM is challenging because of high hardware cost and execution time due to the large FP value range. In this work we present ReFloat, a data format and an accelerator architecture, for low-cost and high-performance floating-point processing in ReRAM for iterative linear solvers. ReFloat matches the ReRAM crossbar hardware and represents a block of FP values with reduced bits and an optimized exponent base for a high range of dynamic representation. Thus, ReFloat achieves less ReRAM crossbar consumption and fewer processing cycles and overcomes the noncovergence issue in a prior work. The evaluation on the SuiteSparse matrices shows that ReFloat achieves 5.02x to 84.28x improvement in terms of solver time compared to a state-of-the-art ReRAM based accelerator.

Workshop

REMORA Resource Monitor: Usability, Performance, and User Interface Improvements

Programming Frameworks and System Software

DescriptionThe paper presents improvements in Remora (REsource Monitoring for Remote Applications), a user-oriented, lightweight system monitoring tool designed for modern HPC systems. Assessing application performance can be complicated. Gathering metrics from various components may overwhelm end users. Hence, some HPC users might be able to make their applications more performant if there are easy-to-use monitoring tools for non-HPC experts. Remora addresses this by providing simple tools for quick diagnostic assessments of an application’s resource usage, and offering flexible and adaptable workflow support. The new release of REMORA v2 provides performance updates and new features. Other improvement include RemoraPy, a Python wrapper, and RP-Stats, a JupyterLab-based GUI, enhancing data collection, visualization, and analysis capabilities.

Workshop

Report on Adaptable Open-Source Disaster Recovery Solution for Multi-Petabyte Storage Systems

Data Movement and Memory

Fault Handling and Tolerance

I/O and File Systems

State of the Practice

DescriptionThe current generation of Research Data Store (RDS) at The University of Sydney comprises a pair of peta-scale data storage systems. We implemented a disaster recovery (DR) solution for data protection against catastrophic failure at either storage system. To handle large amount of data transactions into RDS, we took an open-source approach to design an adaptable DR solution that enables parallelized data replication capability between the pair of storage systems. In the last three years of operations, the DR solution has gone through a few iterations which saw improvement in efficiency. In this paper, we present the findings and outcomes from our DR implementation.

Workshop

Research Software Engineers in HPC (RSE-HPC-2023)

Software Engineering

DescriptionResearch software engineers (RSEs) are critical to the impact of HPC, data science, and the larger scientific community. They have existed for decades, though often not under that name. The past several years, however, have seen the development of the RSE concept, common job titles, and career paths; the creation of professional networks to connect RSEs; and the emergence of RSE groups at universities, national laboratories, and industry.

This workshop will bring together RSEs and allies involved in HPC, from all over the world, to grow the RSE community by establishing and strengthening professional networks of current RSEs and RSE leaders. We will hear about successes and challenges that RSEs and RSE groups have experienced and discuss ways to increase awareness of RSE opportunities and improve support for RSEs.

The workshop will be highly interactive, featuring breakout discussions and panels, as well as invited addresses and submitted talks.

Workshop

Resource Disaggregation in Practice – Industry Session

Architecture and Networks

Data Movement and Memory

Resource Management

Students@SC

Resume Clinic

TUT

XO/EX

Students@SC

Resume Clinic: Competitive Edge for Resumes and Interviews

TUT

XO/EX

DescriptionEngage with professionals on the “front-line” of the HPC-specific hiring process in an interactive presentation designed for all career levels – from students entering the job market or seeking internships to senior technical staff looking for a career change. Technical and HR staff who review hundreds of resumes and interview dozens of candidates every year will present best practices and tips to give you a competitive edge on the job market. In addition to resumes and interviews, you’ll learn about high performance teams and what hiring managers, business leaders, and peers in the industry look for when evaluating team members for competitive projects. The presentation will include a mock resume review and mock interviews with perspectives from both academic/research teams and industry. We will share insights and strategies for resumes, interviews and negotiating offers.

This session will offer an interactive presentation from 2:00 – 3:00 PM followed by a resume clinic from 3:00 – 5:00 PM during which attendees can bring their resume for a one-to-one review session with a mentor.

Workshop

Rethinking Data Race Detection in MPI-RMA Programs

Applications

Software Engineering

DescriptionSupercomputers are capable of more and more computations, and nodes forming them need to communicate even more efficiently with each other. Thus, new types of communication models gain traction in the community to promote overlapping communications with computations. For instance, the Message Passing Interface (MPI) proposes a communication model based on one-sided communications called the MPI Remote Memory Access (MPI-RMA). Thanks to these operations, applications can improve the overlap of communications with computations. However, one-sided communications are complex to write since they are subject to data races. Tools helping developers detect data races caused by MPI one-sided communications are thus emerging. This work rethinks an existing on-the-fly data race detection algorithm for MPI-RMA programs by improving the storage of memory accesses, thus improving its accuracy and reducing the overhead at runtime.

Paper

Rethinking Deployment for Serverless Functions: A Performance-First Perspective

Cloud Computing

Distributed Computing

Energy Efficiency

Performance Measurement, Modeling, and Tools

DescriptionServerless computing commonly adopts strong isolation mechanisms for deploying functions, which may bring significant performance overhead because each function needs to run in a completely new environment, i.e., the “one-to-one” model. To accelerate the function computation, prior work has proposed using sandbox sharing to reduce the overhead, i.e., the “many-to-one” model. Nonetheless, either process-based true-parallelism or thread-based pseudo-parallelism prevents its adaptation for latency-sensitive web services.

To achieve optimal performance and resource efficiency for serverless workflow, we argue an “m-to-n” deployment model that manipulates multiple granularities of computing abstractions (e.g., processes, threads), and sandboxes to amortize overhead. We propose wrap, a new deployment abstraction that balances the tradeoffs between interaction overhead, startup overhead and function execution. We further design Chiron, a wrap-based deployment manager that can automatically perform the orchestration of multiple computing abstractions based on performance prioritization. Our comprehensive evaluation indicates Chiron outperforms state-of-the-art systems by 1.3x-21.8x on system throughput.

Workshop

RISA: Round-Robin Intra-Rack Friendly Scheduling Algorithm for Disaggregated Datacenters

Architecture and Networks

Data Movement and Memory

Resource Management

DescriptionRecent trends see a move away from a fixed-resource server-centric datacenter model to a more adaptable “disaggregated” datacenter model. These disaggregated datacenters can then dynamically group resources to the specific requirements of an incoming workload, thereby improving efficiency. To properly utilize these disaggregated datacenters, workload allocation techniques must examine the current state of the datacenter and choose resources that not only optimize the current workload request, but future ones. Since disaggregated datacenters are severely bottlenecked by the available network resources, our work proposes a heuristic-based approach called RISA, which significantly reduces the network usage of workload allocations in disaggregated datacenters. Compared to the state-of-the-art, RISA reduces the power consumption for optical components by 33% and reduces the average CPU-RAM round-trip latency by 50%. Additionally, RISA significantly outperforms the state-of-the-art in terms of execution time.

Workshop

RISC-V Everywhere

Architecture and Networks

Hardware Technologies

DescriptionThis talk will include information about Open Standards, who is using RISC-V for HPC, how we get to application and system software portability, what we've done this year and what is coming up. It is a basic introduction and should help people considering RISC-V prepare to take next steps in creating a product.

Workshop

RISC-V for HPC – Afternoon Break

Architecture and Networks

Hardware Technologies

Workshop

Risk-Aware Scheduling Algorithms for Variable Capacity Resources

Modeling and Simulation

Performance Measurement, Modeling, and Tools

DescriptionThis work focuses on the design of scheduling algorithms for independent jobs that are submitted to a platform whose resource capacity varies over time. Jobs are submitted online and assigned on a target machine by the scheduler, which is agnostic to the rate and amount of resource variation. The optimization objective is the goodput. We introduce several novel algorithms that: (i) decide which fraction of the resources can be used safely; (ii) maintain a risk index associated to each machine; and (iii) achieves a global load balance while mapping longer jobs to safer machines. We assess the performance of these algorithms using one set of actual workflow traces together with three sets of synthetic traces. The goodput achieved by our algorithms increases up to 10% compared to standard first-fit approaches, while we never experience any loss in complementary metrics such as the maximum or average stretch.

Workshop

RMARaceBench: A Microbenchmark Suite to Evaluate Race Detection Tools for RMA Programs

Applications

Software Engineering

DescriptionParallel programming models with Remote Memory Access (RMA), such as MPI RMA, OpenSHMEM, and GASPI, allow processes to modify the memory of other processes directly. Special care is needed to avoid concurrent conflicting accesses that lead to data races across processes with undefined behavior. Although only some tools exist that can detect RMA races, there needs to be a possibility to compare their effectiveness systematically. We present RMARaceBench, a microbenchmark suite, to evaluate the detection capabilities of current and future RMA race detection tools for MPI RMA, OpenSHMEM, and GASPI. It consists of about 100 synthetic race test cases for each programming model, aiming to cover all possible race scenarios. Using RMARaceBench, we evaluate two MPI RMA race detectors implemented in the correctness tools PARCOACH and MUST. The evaluation shows that RMARaceBench can pinpoint the strengths and weaknesses of RMA race detectors.

Workshop

RMF for HPC and RDT&E

Distributed Computing

Security

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Road To Reliability: Optimizing Self-Driving Consistency With Real-Time Speed Data

XO/EX

DescriptionSelf-driving cars can potentially improve transportation efficiency and reduce human fatalities – provided they have access to significant processing power and large amounts of data. One popular approach for actualizing autonomous vehicles is using end-to-end learning, in which a machine learning model is trained on a large data set of real human driving. This poster shows how self-driving consistency can be improved using a Convolutional Neural Network (CNN) to predict current velocity. Our approach first reproduces an end-to-end learning result and then extends it with real-time speed data as additional model input.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

ROI Preservation in Streaming Lossy Compression

DescriptionToday’s state-of-the-art scientific high-performance computing (HPC) applications generate extensive data in diverse domains, placing a significant strain on data transfer and storage systems. Most compression algorithms are more computationally complex, requiring more processing power and time to compress and decompress data. However, these algorithms tend to achieve higher compression ratios resulting in smaller compressed data sizes. Real-time streaming applications demand high data throughput. Therefore, striking a right balance between compression efficiency and computational complexity is essential. This poster explores two key aspects: interpolation method of 'sz3' algorithm for data reconstruction and the application of 'szx' algorithm on a 'Region of Interest(ROI)' - where a lesser data distortion is needed. We perform a through evaluation using NYX scientific dataset. Experiments show that compression ratio is improved by ~2x. Compression and decompression rates are improved by ~5-7x when contiguous ROI is preserved and only certain recursive levels of sx3 are processed.

Workshop

ROSS – Morning Break

Middleware and System Software

Programming Frameworks and System Software

Runtime Systems

Workshop

ROSS – Opening Panel: Is Accelerator Firmware the New HPC OS? Opportunities and Challenges for the OS/R Research Community

Middleware and System Software

Programming Frameworks and System Software

Runtime Systems

DescriptionThe proposition for the panel is that specialized (lightweight) OS kernel architectures have become the dominant OS architecture for large-scale HPC systems, with the caveat that these specialized OSes are located in the opaque firmware blobs running on hardware accelerators (primarily GPUs). It is likely that this trend will continue with the majority (if not all) of the performance on the systems being managed by black-box firmware that is only accessible via a work-queue interface implemented by an often proprietary driver stack. Projecting into the future, performance will likely no longer be the primary concern for the open/modifiable components of supercomputing OS architectures, and so the community's research focus will instead need to shift to new capabilities and features that we can bring to the HPC environments. These features could include, for example, multi-tenancy capabilities, security partitioning and confidential computing, support for on-demand workloads with real-time constraints, and integration with edge resources and scientific instruments. An alternative viewpoint is that the research community should instead shift to custom/open hardware solutions that are either designed specifically for research or developed as part of a co-design effort with hardware architects. The purpose of this panel is to foster a conversation amongst the community about how we as a community should address the current landscape of HPC architectures; specifically, whether we should shift the focus of OS/R research away from performance-oriented approaches, and what new potential research opportunities are emerging in an accelerator-dominated ecosystem.

Workshop

ROSS – Welcome and Introduction

Middleware and System Software

Programming Frameworks and System Software

Runtime Systems

Workshop

RSDHA – Morning Break

Accelerators

Edge Computing

Heterogeneous Computing

Workshop

RSDHA – Panel Discussion

Accelerators

Edge Computing

Heterogeneous Computing

Workshop

RSE-HPC-2023 – Breakout Discussions

Software Engineering

Workshop

RSE-HPC-2023 – Featured Talk: UNIVERSE-HPC – Toward a Sustainable RSE Training Ecosystem

Software Engineering

Workshop

RSE-HPC-2023 – Morning Break

Software Engineering

Workshop

RSE-HPC-2023 – Panel: RSE Training and Mentoring

Software Engineering

Workshop

RSE-HPC-2023 – Report Back from Breakouts

Software Engineering

Workshop

RSE-HPC-2023 – Wrapup

Software Engineering

Workshop

RSE-HPC-2023 - Welcome and Overview

Software Engineering

Panel

RSEs in HPC Centers: Funding, Coordinating, Doing

Software Engineering

DescriptionResearch Software Engineering (RSEng) as a professional designation has grown over the last 10+ years in industry, academia, and government sectors. Within HPC centers, Research Software Engineers (RSE) fill the role of combining software engineering expertise with the in-depth process of participating in and applying research. In this panel, we invite practicing RSEs, funders, university, and HPC center leaders who are experienced and dedicated to Research Software Engineering to present their varying perspectives on funding, managing, and doing RSEng within worldwide HPC centers. The moderator is Daniel S. Katz (Chief Scientist, NCSA; co-founder, US-RSE), and panelists are Gabrielle Allen (Director, School of Computing, University of Wyoming), Neil Chue Hong (EPCC, University of Edinburgh; Director, Software Sustainability Institute), Alison Kennedy (Strategic Advisor, UK Research and Innovation), Fabio Kon (Special Advisor, São Paulo Research Foundation), and Miranda Mundt (RSE, Sandia National Laboratories; Steering Committee Member, US-RSE).

Paper

Runtime Composition of Iterations for Fusing Loop-Carried Sparse Dependence

Compilers

Performance Measurement, Modeling, and Tools

Performance Optimization

Programming Frameworks and System Software

DescriptionDependence between iterations in sparse computations causes inefficient use of memory and computation resources. This paper proposes sparse fusion, a technique that generates efficient parallel code for the combination of two sparse matrix kernels, where at least one of the kernels has loop-carried dependencies. Existing implementations optimize individual sparse kernels separately. However, this approach leads to synchronization overheads and load imbalance due to the irregular dependence patterns of sparse kernels, as well as inefficient cache usage due to their irregular memory access patterns. Sparse fusion uses a novel inspection strategy and code transformation to generate parallel fused code optimized for data locality and load balance. Sparse fusion outperforms the best of unfused implementations using ParSy and MKL by an average of 4.2× and is faster than the best of fused implementations using existing scheduling algorithms, such as LBC, DAGP, and wavefront by an average of 4× for various kernel combinations.

Panel

Runtimes and Workflow Systems for Extreme Heterogeneity: Challenges and Opportunities

Codesign

Heterogeneous Computing

Runtime Systems

DescriptionExtreme heterogeneity is defined as one of the most important priority research directions today. Additionally, the applications are expected to grow in complexity to enable progress in multiple areas of science, technology and engineering. This urges consideration of hardware/software co-design to facilitate the adoption of emerging technologies. With such a scenario in mind, the opportunities lie in designing new features in runtimes and workflows. This panel aims to debate how future systems will look. Advances in this matter are key to executing science workflows and understanding their results, enabling efficient execution on diverse platforms, ensuring scalability of high-level descriptions of analytics workflows, and increasing user productivity and system utilization. In other words, how easily and rapidly a science team can develop or port a workflow to a new platform, and how well the resulting implementation makes use of the platform and its resources.

Workshop

S-HPC 2023 – Afternoon Break

Distributed Computing

Security

Workshop

S4PST: Stewardship of Programming Systems and Tools

State of the Practice

DescriptionPanel overview for S4PST, one of the projects for the US Department of Energy on the "Next Generation of Scientific Software Technologies" initiative for the post Exascale Computing Project. Topics are related to how programming systems play a crucial role to democratize high-performance computing (HPC). Panelists represent existing technologies in the portfolio of DOE and will discuss challenges and emerging technologies in the integration of HPC and AI.

Students@SC

SC Students HPC Crash Course

TUT

XO/EX

DescriptionThis event, which will take place from 9:00-3:00 on November 13 in person at SC in rooms 705-707, is open to anyone interested in learning more about High-Performance Computing (HPC). Participants will receive an overview of HPC programming environments, parallel programming models, job schedulers, and job launchers. Afterward, they will be directed to self-guided HPC challenges covering basic parallel programming, AI, and GPU programming topics. These challenges will be performed on OLCF’s Frontier Exascale system and Purdue’s Anvil Supercomputer. Frontier is the currently the most powerful supercomputing in the world. Students will have access to Frontier during the workshop and to Anvil afterward until December 13 to complete the exercises required for an HPC Crash Course certificate.

Pre-Workshop Session:
We will host pre-workshop help sessions on November 7 to review requirements, what to expect, and why you should know about HPC at 1:00 p.m. ET and 7:00 p.m. ET.

To attend one of the help sessions, please put your email in this sheet and indicate which time works for you. We will send you the joining link before November 7, 2023.

Signup sheet for Pre-workshop Help session: https://docs.google.com/spreadsheets/d/1ApA3vfa_jvzzmgNXu3LMPQHhHtx7foZPzD49MWs8FFk/edit?usp=sharing.

Frontier Access Requirements: Eligible participants will be provided access tokens and usernames for Frontier during the workshop. To gain access to Frontier:

1. Bring a government-issued photo ID to the workshop for quick access vetting.
2. Bring an internet-ready Laptop to the event.
3. Note that foreign nationals from countries listed in section 15 CFR 740.7 License Exceptions for Computers (including Cuba, Iran, North Korea, Sudan, and Syria) may require a lengthy approval process for access to DOE supercomputers. If approval cannot be obtained in time for the Crash Course, affected participants can apply to work on Anvil.

Anvil Access Requirements: Participants who need more time to complete exercises after the workshop or who cannot gain access to Frontier can apply for access to Anvil. It is strongly recommended that participants apply for Anvil access ahead of the workshop.

To apply for Anvil access, follow these steps:

1. For participants who do not have ACCESS ID already, Please go to https://identity.access-ci.org/new-user and follow the instructions listed here https://drive.google.com/file/d/1LA9MjC__fow7Yr-NEmGTCTqLs0ZV26_Z/view?usp=share_link.
2. Once you have the access ID please enter it here: https://docs.google.com/spreadsheets/d/1ApA3vfa_jvzzmgNXu3LMPQHhHtx7foZPzD49MWs8FFk/edit?usp=sharing. Once we have your Access ID, Anvil admins will add your access.

Awards

SC23 Awards Ceremony

XO/EX

DescriptionThe SC23 conference awards, as well as selected ACM, IEEE and SigHPC awards, will be presented.

Workshop

SC23 Digital Twins Practices and Principles for HPC – Panel Discussion

State of the Practice

DescriptionThis panel will be a question/response format based on a series of questions/topics in contemporary digital twins practices and ask the panelists to bring their perspectives and principles to address these questions.

Workshop

SC23 Digital Twins Workshop Agenda

State of the Practice

DescriptionSC23 Digital Twins Workshop Agenda overview

SC24

SC24 Conference Preview

XO/EX

Posters

Research Posters

SCALABLE – Scalable Lattice Boltzmann Leaps to Exascale

XO/EX

DescriptionThe SCALABLE project aims to enhance an industrial Lattice Boltzmann Method (LBM)-based computational fluid dynamics (CFD) solver for current and future extreme-scale architectures, while ensuring accessibility for end-users and developers. This is accomplished by transferring technology and knowledge between academic code waLBerla and industrial code LaBS.

This poster introduces both software packages and the technology transfer process, resulting in improved CPU and GPU performance and increased interest in energy efficiency.

LBM are trustworthy alternatives to conventional CFD, showing roughly an order of magnitude performance advantage over Navier-Stokes approaches in comparable scenarios.

SCALABLE unites waLBerla and LaBS developers to improve both solvers in terms of portability (targeting GPUs for example), energy efficiency scenarios and transferring techniques between the two to achieve high performance, scalability, and energy efficiency.

Posters

Research Posters

Scalable Algorithms for Analyzing Large Dynamic Networks Using CANDY

XO/EX

DescriptionAs the dynamic network’s topology undergoes temporal alterations, associated graph properties must be updated to ensure their ac- curacy. Addressing this requirement efficiently in large dynamic networks led to the proposal of a generic framework, CANDY (Cyberinfrastructure for Accelerating Innovation in Network Dynamics). This paper expounds on the development of algorithms and subsequent performance improvements facilitated by CANDY.

Panel

Scalable and Adaptable Architectures for AI/HPC Advancement

Artificial Intelligence/Machine Learning

DescriptionAI/Machine Learning usage is exploding in both application and model size. Predictive analytics, physics, modeling, and new use cases for generative AI/ML are increasing model sizes by 10x every 18 months. The custom processors and accelerators used for AI/ML require continually higher I/O bandwidth to address this model growth. However, how does one deploy a high-performance architecture that is scalable and adaptable through time to address this phenomenon? The panel will discuss the architectures, I/O and large-scale system topologies that will be needed to grow well beyond 200 billion parameters. You will gain insights into system concepts, scaled across workload size, that are both cost-effective from a new configurability perspective as well as a focus on energy-efficiency. Is there a new Billion Parameters per Watt metric? These are the topics the panel will discuss and debate.

Tutorial

Scalable Big Data Processing on High Performance Computing Systems

Architecture and Networks

Data Movement and Memory

Message Passing

TUT

DescriptionThere are several popular Big Data processing frameworks including Apache Spark and Dask. These frameworks are not capable of exploiting high-speed and low-latency networks like InfiniBand, Omni-Path, Slingshot, and others. In the High Performance Computing (HPC) community, the Message Passing Interface (MPI) libraries are widely adopted to tackle this issue by executing scientific and engineering applications on parallel hardware connected via fast interconnect.

This tutorial introduces MPI4Spark and MPI4Dask that are enhanced Spark and Dask frameworks, respectively, and capable of utilizing MPI for communication in a parallel and distributed setting on HPC systems. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication. It also maintains isolation for application execution by forking new processes using Dynamic Process Management (DPM). MPI4Spark also provides portability and performance benefits as it can utilize popular HPC interconnects. MPI4Dask is an MPI-based custom Dask framework that is targeted for modern HPC clusters built with CPU and NVIDIA GPUs.

This tutorial provides a detailed overview of the design, implementation, and evaluation of MPI4Spark and MPI4Dask on state-of-the-art HPC systems. Later, we also cover writing, running, and demonstrating user Big Data applications on HPC systems.

Posters

Research Posters

Scalable Fine-Grained Gang Scheduling for HPC Systems with Unreliable Broadcast Synchronization Mechanisms

XO/EX

DescriptionThe demand for interactivity on HPC systems is increasing, primarily driven by new HPC users from the AI/ML research area. Traditional HPC users are accustomed to waiting for job execution on a batch scheduling system while new users prefer an interactive terminal such as Jupyter Notebook. To address these evolving requirements, enhancing interactivity is essential. Fine-grained gang scheduling is one potential solution for this problem. This poster presents a scalable inter-node synchronization mechanism that facilitates well-time-aligned synchronization message delivery through broadcast communication for fine-grained gang scheduling in HPC systems. The mechanism improved the application performance by 2.7 times in comparison to the existing implementation, when simultaneously executing two parallel applications on 128 computing nodes with a 500 ms time slice.

Workshop

Scalable Graph Analytics and HPC Operational Enhancement: Parallel Computing and ML/DL Innovations

State of the Practice

DescriptionParallel computing plays a pivotal role in the efficient processing of vast-scale graph. Complex network analysis stands as a captivating frontier of research, holding promise across diverse scientific domains such as sociology, biology, online media, and recommendation systems. In this era, Machine Learning (ML) and Deep Learning (DL) emerge as indispensable tools, underpinning remarkable technological achievements. Within this dynamic landscape, my research revolves around the advancement of parallel algorithms tailored for large-scale graph operations. To achieve this, I harness the power of cutting-edge technologies including OpenMP, MPI, HIP, and CUDA, on the High-Performance Computing (HPC) platforms to unlock optimal performance. I also apply ML/DL techniques to HPC operational data, to streamline the monitoring and maintenance of supercomputers, alleviating the complexities associated with their upkeep and enhancing user support. My research echoes the synergy between parallel computing, large-scale graph analysis, and ML/DL, enhancing computational efficiency and user experience.

Workshop

Scalable Lead Prediction with Transformers Using HPC Resources

Applications

State of the Practice

DescriptionA promising direction in cancer drug discovery is high-throughput screening of extensive compound datasets to identify advantageous properties, including their ability to interact with relevant biomolecules such as proteins. However, traditional structural approaches for assessing binding affinity, such as free energy methods or molecular docking, pose significant computational bottlenecks when dealing with such vast datasets. To address this, we have developed a docking surrogate called the SMILES transformer (ST), which learns molecular features from the SMILES representation of compounds and approximates their binding affinity. SMILES data is first tokenized using a well-established SMILES-pair tokenizer and fed into a BERT-like Transformer model to generate vector embeddings for each molecule, effectively capturing the essential information. These extracted embeddings are then fed into a regression model to predict the binding affinity. Leveraging the high-performance computing resources at Argonne National Lab, we devised a workflow to scale model training and inference across multiple supercomputing nodes. To evaluate the performance and accuracy of our workflow, we conducted experiments using molecular docking binding affinity data on multiple receptors, comparing ST with another state-of-the-art docking surrogate. Impressively, both surrogates yielded comparable val-r2 measurements of between 70 and 90%, affirming the capability of ST to learn molecular features directly from language-based data. Furthermore, one significant advantage of the ST approach is its notably faster tokenization preprocessing compared to the alternative method, which requires generating molecular descriptors using Mordred. Our workflow facilitated screening of ~ 3 billion compounds on 48 nodes of the Polaris supercomputer in approximately an hour. In summary, our approach presents an efficient means to screen extensive compound databases for potential molecular properties that could serve as lead compounds targeting cancer. Looking ahead, an important future direction for our workflow involves integrating de-novo drug design, enabling us to scale our efforts to explore the limits of synthesizable compounds within chemical space.

Posters

Research Posters

Scalable Reduced-Order Modeling for Three-Dimensional Turbulent Flow

XO/EX

DescriptionA neural network-based reduced order modeling method for three-dimensional turbulent flow simulation is proposed. This method was implemented as the scalable distributed learning on Fugaku. Our method constitutes a dimensional reduction using a convolutional-autoencoder-like neural network and the time evolution prediction using long short-term memory networks. The time evolution of the turbulent three-dimensional flow (e.g., Re=2.8×10^6) could be simulated at a significantly lower cost (approximately four orders of magnitude) without a major loss in accuracy. Using the single core memory group, our implementation shows 370 GFLOPS (24.28% of the peak performance) for the entire training loop and 753 GFLOPS (24.28% of the peak performance) for the convolution kernel. Our implementation scales up to 25,250 computational nodes (1,212,000 cores). Thus it shows 72.9 % of weak scaling performance (7.8 PFLOPS) for the entire training loop. On the other hand, the convolution routine shows 100.8% of weak scaling performance (113 PFLOPS).

Paper

Scalable Tuning of (OpenMP) GPU Applications via Kernel Record and Replay

Accelerators

Distributed Computing

Middleware and System Software

Performance Measurement, Modeling, and Tools

Post-Moore Computing

Best Paper Finalist

DescriptionHPC is a heterogeneous world in which host and device code are interleaved throughout the application. Given the significant performance advantage of accelerators, device code execution time is becoming the new bottleneck. Tuning the accelerated parts is consequently highly desirable but often impractical due to the large overall application runtime which includes unrelated host parts.

We propose a Record-Replay (RR) mechanism to facilitate auto-tuning of large (OpenMP) offload applications. RR dissects the application, effectively isolating GPU kernels into independent executables. These comparatively small code-lets are amenable to various forms of post-processing, including elaborate auto-tuning. By eliminating the resource requirements and application dependencies, massively parallel and distributed auto-tuning becomes feasible.

Using RR, we run scalable Bayesian Optimization to determine optimal kernel launch parameters. LULESH showcases an end-to-end speedup of up to 1.53x, while RR enables 102x faster tuning compared to existing approaches using the entire application.

Workshop

ScalAH'23 – Afternoon Break

Algorithms

Heterogeneous Computing

Large Scale Systems

Workshop

ScalAH'23 – Lunch Break

Algorithms

Heterogeneous Computing

Large Scale Systems

Workshop

ScalAH'23 – Morning Break

Algorithms

Heterogeneous Computing

Large Scale Systems

Workshop

Scale Composite BaaS Services with AFCL Workflows

Applications

Cloud Computing

Distributed Computing

Edge Computing

Large Scale Systems

DescriptionDue to various restrictions in serverless computing, developers face significant challenges to pipeline multiple Backend-as-a-Service (BaaS) services, which is restricted by the maximum size of the serverless function’s deployment package, or by throughput and concurrency restrictions for functions and BaaS services. To bridge this gap, we introduce a methodology how to code scalable and composite BaaS services in form of serverless workflows. Using the Abstract Function Choreography Language (AFCL), we develop and characterize two scalable and composite BaaS services (i) pdf2SpeechDE, which translates a pdf file written in English and converts the extracted text into audio file in German; and (ii) Speech2SpeechDE, which translates audio files from English into a single audio file in German. We composed pdf2SpeechDE and speech2SpeechDE as a sequence of three BaaS services each, including split-merge functions for scalability. The characterization showed that there is no dominating provider for individual BaaS services.

Workshop

Scaling Computational Fluid Dynamics: In Situ Visualization of NekRS Using SENSEI

Data Analysis, Visualization, and Storage

Large Scale Systems

Performance Measurement, Modeling, and Tools

DescriptionComputational Fluid Dynamics (CFD) demands immense memory and computational power, leading to reliance on high-end computing platforms. Traditional methods, like checkpointing, are inefficient, often slowing simulations when data is saved. As technology advances towards exascale and GPU-powered High-Performance Computing (HPC), the dilemma emerges: either compromise data accuracy or decrease resolution. Addressing this, our research promotes in situ analysis and visualization techniques. This approach allows for more regular data snapshots directly from memory, bypassing the pitfalls of checkpointing. We delve into our application of NekRS, a GPU-centric thermal-fluid simulation code, showcasing diverse in situ strategies. To demonstrate real-world implications, we conducted experiments on the Polaris and JUWELS Booster supercomputers. These tests offer crucial insights, suggesting that with careful methodology, one can achieve efficient data management without compromising accuracy, even in the most demanding computational scenarios.

Doctoral Showcase

Posters

Scaling HPC Applications through Predictable and Reliable Data Reduction Methods

Data Compression

I/O and File Systems

DescriptionFor scientists and engineers, large-scale computer systems are one of the most powerful tools to solve complex high-performance computing (HPC) and Deep Learning (DL) problems. With the ever-increasing computing power such as the new generation of exascale (one exaflop or a billion billion calculations per second) supercomputers, the gap between computing power and limited storage capacity and I/O bandwidth has become a major challenge for scientists and engineers. Large-scale scientific simulations on parallel computers can generate extremely large amounts of data that are highly compute and storage intensive. This study will introduce data reduction techniques as a promising solution to significantly reduce the data sizes while maintaining high data fidelity for post-analyses in HPC applications. This study can be categorized into mainly four scenarios: (1) A ratio-quality model that makes lossy compression predictable; (2) advanced parallel write solution with async-I/O; (3) in-situ data reduction for scientific applications; and (4) in-situ data reduction for large-scale machine learning.

Workshop

Scaling HPC Education

Education

State of the Practice

DescriptionThroughout the cyberinfrastructure community there is a large range of resources available to train faculty and young scholars about successful utilization of computational resources for research. The challenge that the community faces is that training materials abound, but they can be difficult to find, and often have little information about the quality or relevance of offerings. Building on existing software technology, we propose to build a way for the community to better share and find training and education materials, through a federated training repository. In this scenario, organizations and authors retain physical and legal ownership of their materials by sharing only catalog information, organizations can refine local portals to use the best and most appropriate materials from both local and remote sources, and learners can take advantage of materials that are reviewed and described more clearly.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Scaling Infrastructure to Support Multi-Trillion Parameter LLM Training

XO/EX

DescriptionThis poster discusses efficient system designs for Large Language Model (LLM) scaling to up to 128 trillion parameters. We use a comprehensive analytical performance model to analyze how such models could be trained on current systems while maintaining 75% Model FLOPS Utilization (MFU). We first show how tensor offloading alone can be used to dramatically increase the size of trainable LLMs. We analyze performance bottlenecks when scaling on systems up to 16,384 GPUs and with models up to 128T parameters. Our findings suggest that current H100 GPUs with 80 GiB of HBM enabled with 512 GiB of tensor offloading capacity allows scaling to 11T-parameter LLMs; and getting to 128T parameters requires 120 GiB of HBM and 2 TiB of offloading memory, yielding 75%+ MFU, which is uncommon even when training much smaller LLMs today.

Posters

Research Posters

Scaling K-Path Centrality Using Optimized Distributed Data Structure

XO/EX

DescriptionK-Path centrality is based on the flow of information in a graph along simple paths of length at most K. This work addresses the computational cost of estimating K-path centrality in large-scale graphs by introducing the random neighbor traversal graph (RaNT-Graph). The distributed graph data structure employs a combination of vertex delegation partitioning and rejection sampling, enabling it to sample massive amounts of random paths on large scale-free graphs. We evaluate our approach by running experiments which demonstrate strong scaling on large real-world graphs. The RaNT-Graph approach achieved a 56,544x speedup over the baseline 1D partition implementation when estimating K-path centrality on a graph with 89 million vertices and 1.9 billion edges.

Workshop

Scaling on Frontier: Uncertainty Quantification Workflow Applications Using ExaWorks to Enable Full System Utilization

Data Analysis, Visualization, and Storage

Large Scale Systems

Programming Frameworks and System Software

Reproducibility

Resource Management

Runtime Systems

DescriptionWhen running at scale, modern scientific workflows require middleware to handle allocated resources, distribute computing payloads and guarantee a resilient execution. While individual steps might not require sophisticated control methods, bringing them together as a whole workflow requires advanced management mechanisms. In this work, we used RADICAL-EnTK (Ensemble Toolkit) - one of the SDK components of the ECP ExaWorks project - to implement and execute the novel Exascale Additive Manufacturing (ExaAM) workflows on up to 8000 compute nodes of the Frontier supercomputer at the Oak Ridge Leadership Computing Facility. EnTK allowed us to address challenges such as varying resource requirements (e.g., heterogeneity, size, and runtime), different execution environment per workflow, and fault tolerance. And a native portability feature of the developed EnTK applications allowed us to adjust these applications for Frontier runs promptly, while ensuring an expected level of resource utilization (up to 90%).

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Scaling Studies for Efficient Parameter Search and Parallelism for Large Language Model Pretraining

XO/EX

DescriptionAI accelerator processing and memory constraints largely dictate the scale in which machine learning workloads (training and inference) can be executed within a desirable time frame. Training a transformer-based model requires the utilization of HPC harnessed through inherent parallelism embedded in processor design, to deliberate modification of neural networks to increase concurrency during training and inference. Our model is the culmination of different performance tests seeking the ideal combination of frameworks and configurations for training a 13-billion-parameter translation model for foreign languages. We performed ETL over the corpus, which involved training a balanced interleaved dataset. We investigated the impact of batch size, learning rate, and different forms of precision on model training time, accuracy, and memory consumption. We use DeepSpeed stage 3 and Huggingface accelerate to parallelize our model. Our model, based on the mT5 architecture, is trained on the mC4 and language-specific datasets, enabling question-answering in the fine-tuning process.

ACM Gordon Bell Finalist

Awards

Scaling the “Memory Wall” for Multi-Dimensional Seismic Processing with Algebraic Compression on Cerebras CS-2 Systems

DescriptionWe exploit the high memory bandwidth of AI-customized Cerebras CS-2 systems for seismic processing. Through low-rank matrix approximation, memory hungry seismic applications fit onto memory-austere SRAM waferscale hardware, addressing a challenge arising in many wave-equation-based algorithms that rely on multi-dimensional convolution operators. Exploiting sparsity inherent in seismic data in the frequency domain, we implement embarrassingly parallel tile low-rank matrix-vector multiplications (TLR-MVM), which account for most of the elapsed time in MDC operations, to solve the Multi-Dimensional Deconvolution (MDD) inverse problem. By reducing memory footprint along with arithmetic complexity, we fit a standard seismic benchmark dataset into the local memories of Cerebras processing elements. TLR-MVM on 48 CS-2 systems in support of MDD gives a sustained memory bandwidth of 92.58PB/s on 35,784,000 processing elements, a significant milestone that highlights the capabilities of AI-customized architectures to enable a new generation of seismic algorithms that will empower multiple technologies of our low-carbon future.

ACM Gordon Bell Finalist

Awards

Scaling the Leading Accuracy of Deep Equivariant Models to Biomolecular Simulations of Realistic Size

DescriptionThis work brings the leading accuracy, sample efficiency, and robustness of deep equivariant neural networks to the extreme computational scale. This is achieved through a combination of innovative model architecture, massive parallelization, and models and implementations optimized for efficient GPU utilization. The resulting Allegro architecture bridges the accuracy/speed tradeoff of atomistic simulations and enables description of dynamics in structures of unprecedented complexity at quantum fidelity. To illustrate the scalability of Allegro, we perform nanoseconds-long stable simulations of protein dynamics and scale up to a 44-million atom structure of a complete, all-atom, explicitly solvated HIV capsid on the Perlmutter supercomputer. We demonstrate excellent strong scaling up to 100 million atoms and 70% weak scaling to 5120 A100 GPUs

Exhibitor Forum

Scaling Up to 32 GPUs to a Single Node Without Changing a Single Line of Code

Accelerators

Artificial Intelligence/Machine Learning

XO/EX

DescriptionThis technical deep dive will demonstrate scaling an application up to 32 accelerators to a single node — which until now was only possible on a supercomputer. This is achieved without needing to modify the application software for HPC or AI workloads, saving users considerable time and effort in porting software.

This new capability was made possible by a deep integration between the engineering teams of AMD and GigaIO. It utilizes off-the-shelf servers and GPUs connected over GigaIO’s native PCIe memory fabric, which provides the same performance and latency as if those accelerators were housed within the server sheet metal.

This talk will cover the steps to create this first-of-its-kind server, the GigaIO SuperNODE, including how to identify and resolve issues that prevent the enumeration of large numbers of GPUs, such as hardcoded limits within ROCm, physical address bit inconsistencies between CPUs (Milan, Genoa) and GPUs, and memory address issues in the VBIOS.

GigaIO will demonstrate how frameworks such as Pytorch and TensorFlow “just work” when run on this all-AMD system, without changing a single line of code. The plug-n-play nature of this solution opens new possibilities for generative AI and machine learning workloads, especially given the current availability constraints on GPUs.

Limitations encountered include the need for server vendors to be willing to modify their server BIOS to accommodate the unexpected number of PCIe end-points and to support dynamic allocation of resources. As such, this solution is only available on selected platforms from those server vendors who have undertaken that effort. Other limiting factors include the total number of BUS IDs and MMIO space.

Tutorial

Scientific Computing with Kubernetes

Applications

Cloud Computing

Performance Measurement, Modeling, and Tools

TUT

DescriptionKubernetes has emerged as the leading container orchestration solution that works on resources ranging from on-prem clusters to commercial clouds. Developed at Google, now maintained by Cloud Native Foundation, it sports a diverse and active development community. At SDSC Kubernetes capabilities are available on Expanse, Voyager, and Prototype National Research Platform (PNRP) Nautilus (multi-site distributed resource) clusters. The ability to run services in Kubernetes enables execution of non-traditional workloads. This can enable some complex scientific workflows to be run that are difficult to handle through traditional batch scheduling on HPC clusters.

Kubernetes does not have a traditional batch interface, but the concepts are similar enough to allow for porting of existing batch-focused workloads to it. Users can customize their software environment in containers. Kubernetes provides significantly richer semantics, including explicit storage and network provisioning, that allow execution of scientific computing workflows typically not feasible on batch systems.

In this tutorial, the attendees will get an overview of the Kubernetes architecture, typical job and workflow submission procedures, learn how to use various storage options, and will learn how to run their software using Kubernetes. Theoretical information will be paired with hands-on sessions operating on the PNRP production Kubernetes cluster Nautilus.

Workshop

Scientific Data Democratization: Enabling Efficient Access and Analysis of Large-Scale Scientific Data

Data Analysis, Visualization, and Storage

Data Compression

DescriptionThe vast amounts of data generated by scientific research hold immense potential for advancing knowledge and discovery. However, the complexity and sheer volume of this data often pose significant challenges in terms of accessibility, analysis, and interpretation. Scientific data democratization aims to address these challenges by enabling researchers to easily access, analyze, and share scientific data, regardless of their technical expertise or the location of the data. My group along with our close collaborators have worked with hundreds of scientific communities in the past 20 years, in many disciplines ranging from astronomy, fusion, combustion, seismology, weather, climate, accelerator science, material science to clinical pathology. In our partnerships we have created sustainable software components which help address the following needs 1) Creating a self-describing I/O framework which allows data to be read/written at terabytes/sec, 2) Having the ability to query PBs of data efficiently even for derived quantities which are NOT contained in the data, 3) Having the ability to subscribe to data (in memory) without modifying the codes such that I/O is abstracting from data-at-rest to data-in-motion, 4) Creating new mathematical formulations which allows data to be reduced in both the size and in the degrees of freedom to allow for faster access, and 5) having the ability to work with federated data, as if it was local. In this presentation, we will explore the challenges and opportunities associated with scientific data democratization, along with some of the work we have done in these fields.

Birds of a Feather

Scientific Software and the People Who Make It Happen: Building Communities of Practice

Programming Frameworks and System Software

Software Engineering

XO/EX

DescriptionSoftware has become central to all aspects of modern science and technology. Especially in high-performance computing (HPC) and computational science and engineering (CSE), it is becoming ever-larger and more complex while computer platforms evolve and become more diverse. Simultaneously, the teams behind the software are becoming larger, more technically diverse, and more geographically distributed.

This BoF provides an opportunity for people concerned about these topics to share existing experiences and activities, discuss how we can improve on them, and share the results. Presentations and discussion notes will be made available at the BoF series website, http://bit.ly/swe-cse-bof.

Exhibits

SCinet

SCinet - Intranet Next

TUT

XO/EX

Exhibits

SCinet

SCinet - SC23 Automating Booth Configurations with NSO and Friends

TUT

XO/EX

Exhibits

SCinet

SCinet - SC23 DevOps Overview - Automation & Monitoring

TUT

XO/EX

Exhibits

SCinet

SCinet - SC23 Edge Networking - Life is Better on the Edge

TUT

XO/EX

Exhibits

SCinet

SCinet - SC23 NetSec Overview - Finding Bad Behaviors at SC23

TUT

XO/EX

Exhibits

SCinet

SCinet - SC23 RPKI adventure

TUT

XO/EX

Exhibits

SCinet

SCinet - SC23 Virtual Network Lab

TUT

XO/EX

Exhibits

SCinet

SCinet - SC23 WAN overview

TUT

XO/EX

Exhibits

SCinet

SCinet - SC23's DNS64 Playbook

TUT

XO/EX

Exhibits

SCinet

SCinet - SC23's NAT64 story - a tale of two protocols

TUT

XO/EX

Workshop

Second International Workshop on RISC-V for HPC

Architecture and Networks

Hardware Technologies

DescriptionRISC-V is an open standard Instruction Set Architecture (ISA) which enables the open development of CPUs and a shared common software ecosystem. There are already approximately 10 billion RISC-V cores, which is expected to accelerate rapidly. Nonetheless, for all the successes that RISC-V has faced, it is yet to become popular in HPC. Recent advances however, such as the vectorisation standard and data center RISC-V-based CPUs, mean that this technology is becoming a more realistic proposition for our workloads.

This workshop aims to connect those currently involved in RISC-V with the wider HPC community. We look to bring together RISC-V experts with scientific software developers, vendors, and supercomputing center operators to explore the advantages, challenges, and opportunities that RISC-V can bring to HPC. Furthermore, we aim to further expand the RISC-V HPC SIG, enabling interested attendees to participate in one of the most exciting open-source technological activities of our time.

Tutorial

Secure Coding Practices and Dependency Analysis Tools

Performance Measurement, Modeling, and Tools

Software Engineering

TUT

DescriptionHPC increasingly involves the development and deployment of network and cloud services. Unique to the HPC field is the large amount of software that we develop to drive these services. These services must assure data integrity and availability, while providing access to a global scientific and engineering community.

Securing your network is not enough! Every service that you deploy is a window into your data center from the outside world, and a window that could be exploited by an attacker.

This tutorial is relevant to anyone wanting to learn about minimizing security flaws in the software they develop or manage. We share our experiences gained from performing vulnerability assessments of critical middleware. You will learn skills critical for software developers and analysts.

Dependency analysis tools –tools that find weaknesses in the software supply chain– are the first line of defense in assessing the security of a software project. These tools can catch flaws in the packages and libraries a program depends upon, and that affects the safety of the application. This tutorial is also relevant to anyone wanting to learn how to use these automated dependency analysis tools to minimize security flaws in the software they develop or manage.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Seeing the Trees for the Forest: Describing HPC Filesystem Trees with the Grand Unified File-Index (GUFI)

XO/EX

DescriptionHigh performance computing (HPC) filesystems are extremely large, complex, and difficult to manage with existing tools. It is challenging for HPC administrators to describe the current structure of their filesystems, predict how they will change over time, and the requirements for future filesystems as they continue to evolve. Previous studies of filesystem characteristics largely predate the modern HPC filesystems of the last decade. The Grand Unified File Index (GUFI) was used to collect the data used to compute the characteristics of six HPC filesystems indexes from Los Alamos National Laboratory (LANL) representing 2.8 PB of data, containing 36 million directories and 600 million files. We present a methodology using GUFI to characterize the shape of HPC filesystems to help system administrators to understand their key characteristics.

This document has been approved for public release under LA-UR-23-28958.

Exhibits

Flash Session

Segment Routing in R&E Networks

XO/EX

DescriptionSegment Routing is a powerful and proven technology for simplifying IP network operations and enabling network programmability. Learn how others are using Segment Routing to improve efficiency and achieve unparalleled network control and visibility. https://www.nokia.com/industries/research-and-education-networks/

Workshop

Self-Service Monitoring of HPC and Openstack Jobs for Users

Cloud Computing

Resource Management

State of the Practice

DescriptionUsing correctly the compute capacity of an HPC or Openstack cluster is often a stumbling block for users, especially those from non-traditional domains where a cluster is only a tool and not the subject of their research.

This paper describes a web portal called TrailblazingTurtle built for HPC and Openstack Cluster to let users view the resources used and wasted by their jobs, without having to modify their workflow. The metrics are collected from various data sources on the cluster to enable monitoring at the job and VM level and are presented to the users and staff members as a simple web application. This platform makes it easy for newer users to request the correct quantity of computing resources for their work, see their impact on the shared file system, and the evolution of the priority of their group in Slurm.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Sensitivity of Black-Box Statistical Prediction of Lossy Compression Ratios for 3D Scientific Data

XO/EX

DescriptionCompression ratio estimation is an important optimization of I/O workflows processing terabytes of data. Applications such as compression auto-tuning or lossy compressor selection require a high-throughput, accurate estimation. Prior works that utilize sampling are fast but inaccurate, while approaches which do not use sampling are highly accurate but slow. Through sensitivity analysis we show that sampling a small number of moderately sized data blocks maintains fast data transfer and yields superior estimation accuracy compared to existing sampling approaches, and we use this to construct a new fast and accurate sampling method. In relation to non-sampling techniques, our method results in less than 10% degradation in estimation accuracy, while still maintaining the high throughput of the less accurate sampling methods.

Workshop

SHDA – Afternoon Break

Accelerators

Codesign

Heterogeneous Computing

Task Parallelism

Workshop

shmem4py: High-Performance One-Sided Communication for Python Applications

Applications

Distributed Computing

Compilers

Heterogeneous Computing

Message Passing

Programming Frameworks and System Software

Task Parallelism

DescriptionWe describe shmem4py, a Python wrapper for the OpenSHMEM application programming interface (API) which follows a design similar to that of the well-known mpi4py package. OpenSHMEM is a descendant of the one-sided communication library for the Cray T3D and it is known for its uncompromising performance for low-latency and high-throughput use cases involving one-sided and collective communication. OpenSHMEM is arguably one of the most efficient and portable abstractions for modern network architectures. Thanks to tight interoperability with NumPy, shmem4py provides a convenient parallel programming framework leveraging both the high-productivity NumPy feature set and the high-performance networking capabilities of OpenSHMEM. This paper discusses the design and performance characteristics of shmem4py in a variety of communication patterns relative to lower-level languages (C) as well as MPI and mpi4py.

Workshop

Short Reasons for Long Vectors in HPC CPUs: A Study Based on RISC-V

Architecture and Networks

Hardware Technologies

DescriptionFor years, SIMD/vector units have enhanced the capabilities of modern CPUs in High-Performance Computing (HPC) and mobile technology. Typical commercially-available SIMD units process up to 8 double-precision elements with one instruction. The optimal vector width and its impact on CPU throughput due to memory latency and bandwidth remain challenging research areas. This study examines the behavior of four computational kernels on a RISC-V core connected to a customizable vector unit, capable of operating up to 256 double precision elements per instruction. The four codes have been purposefully selected to represent non-dense workloads: SpMV, BFS, PageRank, FFT. The experimental setup allows us to measure their performance while varying the vector length, the memory latency, and bandwidth. Our results not only show that larger vector lengths allow for better tolerance of limitations in the memory subsystem but also offer hope to code developers beyond dense linear algebra.

Birds of a Feather

SIGHPC Annual Member Meeting

Data Analysis, Visualization, and Storage

XO/EX

DescriptionThe annual business meeting of SIGHPC is your opportunity to hear about and discuss the status of SIGHPC and its chapters. All of the elected officers and many of the other volunteers will be present to answer your questions about SIGHPC. Representatives from our chapters will also be available. We will also be discussing upcoming plans for the year.

Posters

Research Posters

Simulating Application Agnostic Process Assignment for Graph Workloads on Dragonfly and Fat Tree Topologies

XO/EX

DescriptionDistributed-memory graph applications are dominated by communication and synchronization overheads. For such applications, the communication pattern comprises of variable-sized data exchanges between process neighbors in a process graph topology, which unlike process grid for rectangular problems is difficult to optimize for enhancing the locality in a sustainable fashion.

Process assignment or remapping can improve the communication performance, however, existing solutions mostly caters to Cartesian process topologies and not the graph topology. In this work, we propose application and topology agnostic process remapping strategies for graph applications. For two communication intensive distributed-memory graph applications (graph clustering and triangle counting), we demonstrate about 30% improvements in the overall execution times through various remapping methodologies via SST-based packet-level simulations on Dragonfly and Fat Tree based network topologies.

Posters

Research Posters

Simulating Larger Quantum Circuits with Circuit Cutting and Quantum Serverless

XO/EX

DescriptionQuantum computation is an emerging technology that promises to be able to solve certain tasks that are out of reach of classical machines alone. However, the limited number and quality of qubits poses a challenge for practical usage of near-term quantum computation. Circuit cutting is a technique to decrease the size of circuits at the cost of an additional sampling overhead. This can enable executing problems larger in size and with higher-quality outcomes than what available quantum hardware would otherwise support.

Here, we use the Circuit Knitting Toolbox (CKT) to demonstrate two applications of circuit cutting. To scale these workloads up to hundreds of qubits, we use Quantum Serverless – a new framework for distributing computationally expensive workloads in the cloud.

Workshop

Simulating Quantum Chemistry on Heterogeneous Architectures

State of the Practice

DescriptionSimulating chemistry with quantum mechanical accuracy is a challenging task, while at the same time crucial to describing certain molecular interactions. On a classical machine, the scaling of quantum chemistry methods ranges from cubic for popular but less accurate methods to exponential for the highest quality methods. Quantum computers may be able to reduce the exponential scaling for high-accuracy methods, but while current quantum devices remain noisy, it is important to fully leverage modern high performance computing hardware. Doing so will enable pushing past present limits to study or even design new chemistry. In this work, we present efforts to scale quantum chemistry simulations to heterogeneous HPC systems, simulate quantum circuits efficiently on classical hardware, and provide an outlook on hybrid quantum/classical approaches.

Posters

Research Posters

Simulating Quantum Systems with NWQ-Sim on HPC

XO/EX

DescriptionNWQ-Sim is a cutting-edge quantum system simulation environment designed to run on classical multi-node, multi-CPU/GPU heterogeneous HPC systems. In this work, we provide a brief overview of NWQ-Sim and its implementation in simulating quantum circuit applications, such as the transverse field Ising model. We also demonstrate how NWQ-Sim can be used to examine the effects of errors that occur on real quantum devices, using a combined device noise model. Moreover, NWQ-Sim is particularly well-suited for implementing variational quantum algorithms where circuits are dynamically generated. Therefore, we also illustrate this with the variational quantum eigensolver (VQE) for the Ising model. In both cases, NWQ-Sim's performance is comparable to or better than alternative simulators. We conclude that NWQ-Sim is a useful and flexible tool for simulating quantum circuits and algorithms, with performance advantages and noise-aware simulation capabilities.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Simultaneous Evaluation of Mindful Fault Checking across the CPU and GPU

XO/EX

DescriptionThis work comprehensively analyzes the overhead when implementing fault-checking algorithms for sparse preconditioned conjugate gradient (PCG) solvers on many-core and GPU-accelerated systems. Our objective is to selectively utilize GPUs for duplicate calculations based on the numerical properties of the sparse matrices to enhance the reliability and performance of linear system solutions. Enabling the ability to rely on the relatively underutilized CPU for fault detection improves scientific applications' ability to efficiently manage their resources on large-scale systems. By leveraging existing fault-checking techniques, we validate calculations and address potential numerical instabilities or precision-related issues during iterative solving. Through extensive experimentation on real hardware, we demonstrate the effectiveness of the conjugate gradient algorithm in providing accurate and reliable solutions for large linear systems.

Workshop

SimuQ: A Domain-Specific Language for Quantum Simulation with Analog Compilation

Quantum Computing

Software Engineering

DescriptionQuantum Hamiltonian simulation is one of the most promising applications of quantum computing. Recent experimental results suggest that Hamiltonian-oriented analog quantum simulation would be advantageous over circuit-oriented digital quantum simulation in the NISQ era. We design and implement SimuQ, the first domain-specific language for quantum Hamiltonian simulation that supports pulse-level compilation to heterogeneous analog quantum simulators. Specifically, in SimuQ, front-end users specify the target quantum system with Hamiltonian Modeling Language, and the Hamiltonian-level programmability of analog quantum simulators is specified through a new abstraction called the abstract analog instruction set (AAIS) and programmed in AAIS Specification Language by hardware providers. Through a solver-based compilation, SimuQ generates executable pulse schedules for real devices to simulate the evolution of desired quantum systems, which is demonstrated on superconducting (IBM), neutral-atom (QuEra), and trapped-ion (IonQ) quantum devices.

Birds of a Feather

Slurm Community BoF

Middleware and System Software

XO/EX

DescriptionSlurm is an open source workload manager used many on Top500 systems and provides a rich set of features including topology aware optimized resource allocation, cloud bursting, hierarchical bank accounts with fair-share job prioritization and many resource limits. The meeting will consist of three parts: The Slurm development team will present details about newly released 23.11 version and changes in the upcoming version 24.08, describe the Slurm roadmap, and solicit user feedback. Everyone interested in Slurm use and/or development is encouraged to attend.

Birds of a Feather

SmartNICs: Exploring the Future of In-Network Computation with the HPC Community

Architecture and Networks

XO/EX

DescriptionSmartNIC availability has rapidly increased in recent years due to wider adoption in the cloud. Leveraging these emerging devices in HPC can provide the infrastructure needed to develop new offloading capabilities that go beyond the traditional packet processing to support HPC optimizations. This BoF aims at building a community to discuss SmartNIC use-cases to accelerate applications, improve storage, enable software-defined infrastructures, address operational aspects of HPC centers and more. It also aims to serve as the state-of-the-union for SmartNICs within the HPC audience, acting as a central hub for sharing information on this emerging technology.

Posters

Research Posters

Software Development Case Study: The Acceleration of a Distributed Application Using GPUs

XO/EX

DescriptionWe present a practical approach for the acceleration of an industrial and scientific application using graphics processing units (GPUs). Our original application is a computational stratigraphy codebase that couples fluid flow and sediment deposition submodels. The application uses domain decomposition and a halo exchange to split the workload among multiple workers in a distributed system. Our methodology abstracts and conserves the host data structures while re-writing computational elements in the GPU programming language CUDA. Utilizing high performance GPU machines in the Azure cloud, we show a minimum 90x speedup compared to a high-end CPU based cluster. In this poster, we give a brief description of the original algorithm, followed by a discussion of required software changes and additions. Although this case study focuses on a specific example, we hope this approach inspires similar efforts in other applications.

Birds of a Feather

Software Testing for Scientific Computing in HPC

Programming Frameworks and System Software

XO/EX

DescriptionEffective software testing plays a critical role in guaranteeing the performance, correctness, and reproducibility of applications and software. When it comes to testing high-performance computing (HPC) software and applications, unique requirements arise due to factors such as massive parallelism, concurrency and heterogeneity, the scale of target platforms, lack of oracles, and application-specific verification and validation techniques. In this BoF session, we aim to foster insightful discussions among a panel of expert speakers and the audience, focusing on methodologies and challenges in HPC software testing, and deepen our understanding in this crucial part of HPC software development.

Tutorial

Solving Optimization Problems Using Near Term Quantum Devices

Algorithms

Quantum Computing

TUT

DescriptionOptimization problems are among the most promising quantum applications that combine the use of quantum processor and classical processors. This tutorial is aimed at teaching the participants to solve optimization problems using two distinct quantum computing paradigms: (i) hybrid quantum-classical algorithms using gate-based systems, and (ii) neutral atom analog Hamiltonian simulations device. In gate-based systems, a parameterized quantum circuit is designed which is then used to compute value of an objective function and iteratively optimize via classical optimization algorithms. Such hybrid algorithms rely on rapid iterative computations of quantum and classical processors, requiring regular sharing of data between them. The analog Hamiltonian simulation quantum device comprises of an array of two-level neutral Rydberg atoms with ground state and excited Rydberg state. The atoms can be arranged in any 1D or 2D geometry and initially prepared in the ground state. The parameters of the driving Hamiltonian are then adiabatically varied and the state of each neutral atom is measured which represents the final solution. The tutorial will provide introduction to quantum computing and demonstrate the aforementioned solutions using hand-on sessions provided via free cloud access to quantum hardware.

Awards

Some Lessons from 50 Years of Parallel Programming

DescriptionParallel programming started around 1970, so as a discipline, it is now more than 50 years old. What lessons have we learned during that time about parallel programming? What problems remain to be solved? What can young researchers learn from the successes and failures of our discipline? This talk presents a personal point of view about these and other questions regarding the state of parallel programming.

Posters

Research Posters

Sophisticated Tools for Performance Analysis and Auto-Tuning of Performance Portable Parallel Programming

XO/EX

DescriptionHPC Software must offer tool support for productive programming of scientific applications run on supercomputers using this HPC Software, especially for the sophisticated activities of performance analysis and auto-tuning. Given the emergence of performance portable programming libraries having abstractions for parallelism, new tool support offered by the HPC Software for such sophisticated activities is needed to handle these library abstractions over multiple backends. Addressing this will allow for software sustainability of performance portable libraries. Considering Kokkos, a representative C++-based Performance Portable Library, we focus on (1) a community-driven tool connector subset of the Kokkos Tools offer capability for such sophisticated activities along with (2) an associated tool infrastructure which includes common interfaces and utilities to enable such sophisticated tools. Showcasing this part of Kokkos Tools shows it is capable, lightweight, easy to use, and a viable alternative of such tools supporting specific low-level programming models.

Paper

Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the Graphcore IPU

Accelerators

Applications

Graph Algorithms and Frameworks

Performance Measurement, Modeling, and Tools

Programming Frameworks and System Software

DescriptionDedicated accelerator hardware has become essential for processing AI-based workloads, leading to the rise of novel accelerator architectures. Furthermore, fundamental differences in memory architecture and parallelism have made these accelerators targets for scientific computing. The sequence alignment problem is fundamental in bioinformatics; we have implemented the X-Drop algorithm, a heuristic method for pairwise alignment that reduces search space, on the Graphcore Intelligence Processor Unit (IPU) accelerator. The X-Drop algorithm has an irregular computational pattern, which makes it difficult to accelerate due to load balancing.

Here, we introduce a graph-based partitioning and queue-based batch system to improve load balancing. Our implementation achieves 10x speedup over a state-of-the-art GPU implementation and up to 4.65x compared to CPU. In addition, we introduce a memory-restricted X-Drop algorithm that reduces memory footprint by 55x and efficiently uses the IPU's limited low-latency SRAM. This optimization further improves the strong scaling performance by 3.6x.

Birds of a Feather

Spack Community BoF

Programming Frameworks and System Software

XO/EX

DescriptionSpack is a package manager for scientific computing, with a rapidly growing open-source community. Spack has over 1000 contributors from academia, industry, and laboratories across the world, and is used to manage software releases for the U.S. Exascale Computing Project. Spack developers will give updates on the community, new features, and the roadmap for future development. We will poll the audience to gather valuable information on how Spack is being used, and we will open the floor for questions. All are invited to provide feedback, request features, and discuss future directions. Help us make installing HPC software simple!

Workshop

Spatiotemporal Analysis and Prediction of Laboratory-Generated Turbulence

State of the Practice

DescriptionInternal waves below the ocean's surface significantly contribute to heat transfer in the global climate system, and are often studied with laboratory experiments like the Stratified Inclined Duct (SID). These experiments generate large amounts of data, creating expensive storage costs. This work is an effort to reduce the volume of data by developing a machine learning model that can accurately classify and predict mixing events in real time, enabling researchers to record and save particular
moments of an experiment.

The model, a convolutional neural network, is trained on 107 experimental shadowgraph videos and achieves nearly 97% accuracy in classifying turbulence on roughly 7,000 shadowgraph frames. Preliminary work indicates promising results for predictive spatiotemporal modeling, as well as the implementation of the curvelet transform in pre-processing to reduce the model size and improve training times.

Workshop

SPEChpc 2021 Benchmarks on Ice Lake and Sapphire Rapids Infiniband Clusters: A Performance and Energy Case Study

Modeling and Simulation

Performance Measurement, Modeling, and Tools

DescriptionWe assess fundamental performance, power, and energy characteristics of the SPEChpc 2021 benchmark suite on two clusters based on Intel Ice Lake and Sapphire Rapids CPUs using MPI only. We use memory bandwidth, data volume, and scalability metrics in order to categorize the benchmarks and pinpoint relevant performance and scalability bottlenecks on the node and cluster levels. Common patterns such as memory bandwidth limitation, dominating communication and synchronization overhead, MPI serialization, superlinear scaling, and alignment issues could be identified, in isolation or in combination, showing that SPEChpc 2021 is representative of many HPC workloads. Power dissipation and energy measurements indicate that the modern Intel server CPUs have such a high idle power level that race-to-idle is the paramount strategy for energy to solution and energy-delay product minimization. On the chip level, only memory-bound code shows a clear advantage of Sapphire Rapids compared to Ice Lake in terms of energy.

Workshop

Specialized Kernels for Optimizing GPU Offload in OpenMP

Accelerators

Compilers

Heterogeneous Computing

Performance Optimization

Programming Frameworks and System Software

Runtime Systems

DescriptionProgramming models for general purpose GPU (GPGPU) computing include grid and non-grid languages. Grid languages like CUDA and HIP map directly to the GPU hardware and can extract high performance from applications. However, this low-level programming approach makes them more difficult to program than non-grid languages such as C, C++, and Fortran with OpenMP target offload. Furthermore, grid languages often have more portability issues than non-grid languages. However, code generated from non-grid languages using automatic compiler and runtime techniques often incur higher overhead while generating GPU kernels.

This presentation discusses compiler and runtime techniques to generate specialized, high-performance kernels for OpenMP target regions in certain common situations. We outline conditions under which specialized kernels are generated for OpenMP target regions, both with and without reduction clauses. Experimental results on AMD GPUs indicate that a large percentage of OpenMP target regions are amenable to specialization and consequent performance improvement.

Workshop

Speeding Up Charge Exchange Recombination Spectroscopy Analysis in Support of NERSC/DIII-D Realtime Workflow

Large Scale Systems

Performance Measurement, Modeling, and Tools

Software Engineering

Workshop

SQCS – Morning Break

Codesign

Hardware Technologies

Large Scale Systems

Software Engineering

Workshop

SQCS'23 – Closing Remarks

Codesign

Hardware Technologies

Large Scale Systems

Software Engineering

DescriptionClosing remarks for SQCS'23.

Workshop

SQCS'23 – Moderated Discussion

Codesign

Hardware Technologies

Large Scale Systems

Software Engineering

DescriptionA moderated audience discussion including our five SQCS'23 panelists.

Workshop

SQCS'23 – Panel

Codesign

Hardware Technologies

Large Scale Systems

Software Engineering

DescriptionFive esteemed HPC experts give an update on recent developments at their institutions and provide their views on this year's panel questions.

Workshop

Starting at the Bottom, Now We’re Here: Building an African RSE Community

Software Engineering

Workshop

State of In Situ Visualization in Simulations: We Are Fast. But Are We Inspiring?

Data Analysis, Visualization, and Storage

Large Scale Systems

Performance Measurement, Modeling, and Tools

DescriptionVisualization of dynamic processes in scientific high-performance computing is an immensely data intensive endeavor. Application codes have recently demonstrated scaling to full-size Exascale machines, and generating high-quality data for visualization is consequently on the machine-scale, easily spanning 100s of TBytes of input to generate a single video frame. In situ visualization, the technique to consume the many-node decomposed data in-memory, as exposed by applications, is the dominant workflow. Although in situ visualization has achieved tremendous progress in the last decade, scaling to system-size together with the application codes that produce its data, there is one important question that we cannot skip: is what we produce insightful and inspiring?

Workshop

Stencil-HMLS: A Multi-Layered Approach to the Automatic Optimization of Stencil Codes on FPGA

Architecture and Networks

DescriptionThe challenges associated with effectively programming FPGAs have been a major blocker in popularizing reconfigurable architectures for HPC workloads. However new compiler technologies, such as MLIR, are providing new capabilities which potentially deliver the ability to extract domain specific information and drive automatic structuring of codes for FPGAs.

We explore domain specific optimizations for stencils, a fundamental access pattern in scientific computing, to obtain high performance on FPGAs via automated code structuring. We propose Stencil-HMLS, a multi-layered approach to automatic optimization of stencil codes and introduce the HLS dialect, which brings FPGA programming into the MLIR ecosystem. Using the PSyclone Fortran DSL, we demonstrate an improvement of 14-100 times with respect to the next best performant state-of-the-art tool. Furthermore, our approach is 14-92 times more energy efficient than the next most energy efficient approach.

Workshop

Streaming Data from Experimental Facilities to Supercomputers for Real-Time Data Processing

Large Scale Systems

Performance Measurement, Modeling, and Tools

Software Engineering

DescriptionIn this paper we demonstrate direct data streaming from instruments and detectors at a large-scale experimental facility to a supercomputer for real-time data processing and feedback. Streaming data to supercomputers introduces the potential for novel scientific applications and workflow models, including the ability to provide real-time feedback from very large datasets during an experiment and the integration of real-time ML training and inference at scale. We discuss a successful demonstration for real-time processing of data from the Advanced Photon Source (APS) on the Polaris supercomputer using an EPICS-based streaming framework. We describe the capabilities of the streaming framework itself, and outline the architecture that allows us to process experimentally derived data on a supercomputer without file-based data transfers. We present throughput measurements that are indicative of system performance capable of sustaining the expected data production rates of the facility, as well as discuss some outstanding challenges and our future directions.

Workshop

Streaming Hardware Compressor Generator Framework

Data Analysis, Visualization, and Storage

Data Compression

DescriptionThe interest in and strong demand for application-specific accelerators in computing and sensor data processing are rising. Simultaneously, data movement bottlenecks are increasingly becoming a significant limiting factor for these accelerators. Integrating an extremely resource-efficient, ultra-low-latency compressor block into their data path or pipeline could solve or mitigate data movement bottlenecks and enhance the performance of these accelerators. However, workflows for hardware compressor architecture exploration are little studied. We introduce a generator framework for designing, verifying, and estimating resources in streaming hardware compressor architectures to fill the gap. This framework assists users in exploring different compressor architectures with different compressor building blocks, evaluating their characteristics (latency, throughput, gate counts), and generating RTL code for integrating them into custom accelerator designs. Our motivation is to bridge the gap between software and hardware experts through this proposed framework as a co-design tool.

Exhibitor Forum

Strong Scaling of State-of-the-Art LLM Inference with Groq Software-Scheduled Deterministic Networks

Accelerators

Artificial Intelligence/Machine Learning

Architecture and Networks

Hardware Technologies

XO/EX

DescriptionIn this talk, we will demonstrate Groq’s approach to synchronous, software-scheduled AI accelerator networks and showcase how we use it to unlock state-of-the-art performance and latency on Large Language Models (LLMs), including Llama-2 70B, scaled to over 500 GroqChip™ Language Processors™.

Traditional HPC systems and data centers use dynamic time- and space-sharing, where platforms dynamically coordinate the use of compute, memory, and network resources among threads or workloads. This is a natural solution for arbitrary compute workloads, whose unpredictability makes such mediation a prerequisite. Unfortunately, this results in compounding inefficiency and complexity at all layers of the stack: processor architecture, memory, networking, and more. Modern AI workloads, however, have a predictable structure allowing for efficient static scheduling of compute and network resources.

Groq is turning this theory to practice by making components deterministic from the ground-up to stand-up large-scale synchronous compute platforms and empower software to make more orchestration decisions statically. Unlike traditional networks where packets can collide and congestion can develop, all traffic in the Groq network is completely pre-planned by Groq™ Compiler with zero network collisions. This maximizes not only the utilization of the links, but the number of minimal paths that can be taken between chips. Deterministic compute and static orchestration does introduce new software and hardware challenges and co-optimization opportunities, which we will discuss in this talk.

Overcoming these challenges unlocks the opportunity for greater compute and power efficiency on AI workloads. Groq’s software-scheduled networks offer key advantages including: (1) a global network load balancing via compiler-driven network traffic scheduling; (2) high network bandwidth efficiency via low control overhead; and (3) low latency chip-to-chip communication via a router-less, handshake-less direct topology. We showcase these advantages by demonstrating state-of-the-art performance on LLM models, including Llama-2 70B, scaled to over 500 Language Processors.

Paper

Structural Coding: A Low-Cost Scheme to Protect CNNs from Large-Granularity Memory Faults

Accelerators

Artificial Intelligence/Machine Learning

Codesign

Fault Handling and Tolerance

Performance Measurement, Modeling, and Tools

Post-Moore Computing

DescriptionThe advent of High Performance Computing has led to the adoption of Convolutional Neural Networks (CNNs) in safety-critical applications such as autonomous vehicles. However, CNNs are vulnerable to DRAM errors corrupting their parameters, thereby degrading their accuracy. Existing techniques for protecting CNNs from DRAM errors are either expensive or fail to protect from large-granularity, multi-bit errors, which occur commonly in DRAMs.

We propose a software-implemented coding scheme, Structural Coding (SC) for protecting CNNs from large-granularity memory errors. SC achieves three orders of magnitude reduction in Silent Data Corruption (SDC) rates of CNNs compared to no protection. Its average error correction coverage is also higher than other software-techniques to protect CNNs from faults in the memory. Further, its average performance, memory, and energy overheads are respectively 3%, 15.71%, and 4.38%. These overheads are much lower than other software protection techniques.

Students@SC

Student Programming Afternoon Break

TUT

XO/EX

Students@SC

Student Programming Morning Break

TUT

XO/EX

Exhibits

SCinet

Student@SC - Application Of Advanced Text Analysis In The Study Of Scientific Literature (University of Warsaw/Accenture)

TUT

XO/EX

Exhibits

SCinet

Student@SC - Molecular Foundation Models: HPC for large scale scientific deep learning (Binghamton University)

TUT

XO/EX

Exhibits

SCinet

Students@SC - Breaking the Curse of Dimensionality of Tensors with Tensors (University of Utah)

TUT

XO/EX

Exhibits

SCinet

Students@SC - Building Trust in AI: The Role of Explainable AI (XAI) (University of Coimbra)

TUT

XO/EX

Exhibits

SCinet

Students@SC - Cancer Prevalence in Electronic Waste Recycling Sites: How Big Data Research Contributes to Health Hazards (Queen's University)

TUT

XO/EX

Exhibits

SCinet

Students@SC - Collective Operations in a task-based distributed runtime (State University of Campinas)

TUT

XO/EX

Exhibits

SCinet

Students@SC - Efficient Large Dynamic Graph Analysis on Emerging Storage Technology (University of North Carolina at Charlotte)

TUT

XO/EX

Exhibits

SCinet

Students@SC - Enabling Transparent, High-Throughput Data Movement for Scientific Workflows on HPC Systems (University of Tennessee, Knoxville)

TUT

XO/EX

Exhibits

SCinet

Students@SC - My recent research and goals (University of Texas Arlington)

TUT

XO/EX

Exhibits

SCinet

Students@SC - Online Boosted Gaussian Learners for in-situ Detection and Characterization of Protein Folding States in Molecular Dynamics Simulations (University of New Mexico, Los Alamos National Lab)

TUT

XO/EX

Exhibits

SCinet

Students@SC - Performance-Aware Power Reduction in Exascale Computing: Leveraging Reinforcement Learning for Application Agnostic Agents (Vanderbilt University)

TUT

XO/EX

Exhibits

SCinet

Students@SC - Unveiling deep insights: fine-grained network telemetry system for R&E networks (Korea Institute of Science and Technology Information, University of Science and Technology)

TUT

XO/EX

Exhibits

SCinet

Students@SC - Using Deep Learning to Improve Pfam Labeling and Identifying Misaligned Protein Sequences (University of alabama in Huntsville)

TUT

XO/EX

Workshop

Sunfish: An Open Centralized Composable HPC Management Framework

Architecture and Networks

Data Movement and Memory

Resource Management

DescriptionTraditional HPC systems are provisioned with sets of static fixed quantities of resources (e.g., memory, storage, accelerators, CPU) to execute requested computation. This is not sufficient for today’s datacenters running modern dynamic workloads, resulting in workloads executing on systems not optimized for their needs. Datacenters often end up over-provisioning systems with hardware resources to provide workload versatility in HPC clusters. Extending Composable Disaggregated Infrastructure (CDI) to HPC architectures enables servers to be composed out of physically disaggregated resources to match the requirements of a workload. Central resource management, using a standardized interface, enables client applications to monitor, compose, and intelligently optimize resource provisioning. The OpenFabrics Alliance in collaboration with DMTF, SNIA, and the CXL Consortium, is developing the Sunfish Management Framework for intelligent HPC CDI control. The goal of Sunfish is to enable interoperability through common interfaces for connecting workloads with resources, without having to worry about underlying hardware technologies.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Supercharging Scientific Serverless: Slashing Cold Starts with Python UniKernels

DescriptionServerless computing platforms use containers to create custom and isolated execution environments. Thus, the time to serve a function in the Function-as-a-Service (FaaS) paradigm, is dependent on the time to load the necessary container. FaaS platforms try to avoid "cold-starts'', instead pre-loading containers to serve workload. We focus on the problem of rapidly loading Python environments in the Globus Compute (previously funcX) platform. Globus Compute is unique in that it is deployed on HPC systems and thus suffers from costs of shared file systems. We evaluate containers and microvmms (Docker and Firecracker) and propose a new approach using lightweight Python Unikernels. We show considerable speed up in cold-start times using Unikernels.

Workshop

SuperCheck-SC23 – Afternoon Break

Fault Handling and Tolerance

Exhibitor Forum

Supercluster-Scale ML Training with Oracle Cloud Infrastructure

Accelerators

Artificial Intelligence/Machine Learning

XO/EX

DescriptionWe have seen a substantial increase in the use of the Oracle Cloud Infrastructure (OCI) for training of large-scale language models (LLM), as more and more startups and established companies seek to gain an edge with increasingly large and more accurate models. These models share a need for efficient GPU cluster computing – that is, the ability to scale training to hundreds or thousands of GPUs for an extended period of time while maintaining performance and efficiency. Performance is crucial both at the level of individual GPU and of scaling efficiently across the network. Scaling these large training models can be very complex and certainly difficult to tune, requiring a cost-effective infrastructure that can provide availability, resiliency, and performance at scale.

In this talk, we will discuss our approach to support the needs of these large-scale language models, building on years of experience running HPC on a bare-metal instances with a very low latency network. We will present Oracle’s “SuperCluster”, which scales to thousands or tens of thousands of Nvidia A100 and H100 GPUs with low latency and high inter-node bandwidth of up to 3,200Gbps. This time-tested bare-metal instance platform is combined with intelligent job placement, locality awareness, and additional tuning to make ML work at the largest scales. Oracle’s SuperClusters have been rigorously tested on well-known public benchmarks such as Megatron, where it reaches very high throughputs, as well as on proprietary cutting-edge models that are commonly used in machine learning. We will show examples of use from various companies and will discuss the challenges that were addressed to run these models at such scale. We will finish the presentation with a discussion of some of the open research problems that still need to be addressed in this area.

Birds of a Feather

SuperCompCloud: Emerging Topics in Supercomputing and Cloud Interoperability

Cloud Computing

XO/EX

DescriptionThe SuperCompCloud series of panels, workshops, and BoFs has a goal of bringing together experts and practitioners from academia, national labs, and industry to discuss technologies, use cases and best practices in order to share vision and direction for leveraging high performance, extreme-scale computing and on-demand cloud ecosystems in light of increasing software complexity, narrowing on-premise infrastructure options, and cloud-only architectures. The session will continue the discussion of the latest challenges and plans in addition to interactive polling to engage the community in discussion with a level of interactivity distinct from the workshop series.

Panel

Superconducting Digital Computing in HPC

Codesign

Hardware Technologies

DescriptionSuperconducting digital computing (SDC) has significant potential to preserve performance scaling for a wide range of HPC applications due to its tens to hundreds of GHz operating frequencies coupled with low dynamic energy. The current limitations of the technology such as device density, EDA tools, data movement, and cooling are active areas of research with promising directions. This, combined with studies that designed SDC accelerators for compute-intensive applications, hint that SDC may play an important role in HPC, though significant work remains to show the best integration strategy with HPC systems and on-sensor processing. In this panel, we invite experts from the superconducting community to discuss SDC’s ecosystem, how SDC may be used in practice in future systems, and the positive impact SDC can have to the performance and efficiency of key HPC applications.

Workshop

Survey of Adaptive Containerization Architectures for HPC

DescriptionContainers offer an array of advantages that benefit research reproducibility and portability. As container tools mature, container security improves, and high-performance computing (HPC) and cloud system tools converge, supercomputing centers are increasingly integrating containers into their workflows. Despite this, most research into containers remains focused on cloud environments.

We consider an adaptive containerization architecture approach, in which each component chosen represents the tool best adapted to the given system and site requirements, with a focus on accelerating the deployment of applications and workflows on HPC systems using containers. To this end, we discuss the HPC specific requirements regarding container tools, and analyze the entire containerization stack, including container engines and registries, in-depth. Finally, we consider various orchestrator and HPC workload manager integration scenarios, including Workload Manager (WLM) in Kubernetes, Kubernetes in WLM, and bridged scenarios. We present a proof-of-concept approach to a Kubernetes Agent in a WLM allocation.

Workshop

Survey of Technologies for Developers of Parallel Applications: Julia

Accelerators

Applications

Distributed Computing

Compilers

Heterogeneous Computing

Message Passing

Programming Frameworks and System Software

Task Parallelism

DescriptionThis talk will highlight distributed and gpu computing using the Julia ecosystem. The Julia language proposes an integrated development end-to-end co-design model as a LLVM front-end for science to close the gap between high-productivity languages and the desired performance of traditional compiled languages on extreme heterogeneous systems. This talk will demonstrate how to develop a large-scale HPC application: from low-level hardware accelerator optimizations (using LLVM), to a task-based distributed parallel execution framework (based on Distributed.jl).

Workshop

Survey of Technologies for Developers of Parallel Applications: Legate and cuNumeric

Accelerators

Applications

Distributed Computing

Compilers

Data Analysis, Visualization, and Storage

Heterogeneous Computing

Message Passing

Programming Frameworks and System Software

Task Parallelism

DescriptionThis talk is about Legate, a framework for scalable and composable distributed software. Legate enables cuNumeric, a GPU accelerated distributed NumPy library, to grow rapidly by providing high-productivity abstractions on top of a scalable runtime system. In this talk, we will explain how Legate enabled the productive development of cuNumeric features and talk about our vision for an ecosystem of composable Legate libraries.

Workshop

Survey of Technologies for Developers of Parallel Applications: Q&A

Applications

Distributed Computing

Compilers

Heterogeneous Computing

Message Passing

Programming Frameworks and System Software

Task Parallelism

DescriptionThe speakers will respond to Q&A for the technologies they presented in the session.

Workshop

Survey of Technologies for Developers of Parallel Applications: SHMEM

Applications

Distributed Computing

Compilers

Heterogeneous Computing

Message Passing

Programming Frameworks and System Software

Task Parallelism

DescriptionOpenSHMEM was introduced more than a decade ago to standardize SHMEM, a library-based communications interface that was originally developed as a proprietary application interface by Cray for their T3D systems. An alternative to MPI that implements a Partitioned Global Address Space (PGAS) programming model, OpenSHMEM combines the look and feel of MPI with the benefits of PGAS programming.

Workshop

Survey of Technologies for Developers of Parallel Applications: Swift/T

Applications

Distributed Computing

Compilers

Data Analysis, Visualization, and Storage

Heterogeneous Computing

Message Passing

Programming Frameworks and System Software

Task Parallelism

DescriptionI will describe how Swift/T, an automatically parallel scripting language, serves as an alternative to MPI by creating a higher-level programming model for workflow-like applications. Swift/T essentially translates a functional description of a workflow into an MPI program runnable at the largest scales through the use of dataflow analysis and an MPI-based task distributor. I will also highlight interesting use cases such as optimizing deep learning models for cancer problems and fitting parameters to the observed behavior of contagious diseases like COVID-19. These examples will show how MPI can be used as an implementation layer for a completely different programming model, and as a complementary model via communicator subdivision and handoff.

Workshop

Sustainability in HPC: Vision and Opportunities

Energy Efficiency

Green Computing

Sustainability

DescriptionTackling climate change by reducing and eventually eliminating carbon emissions is a significant milestone on the path toward establishing an environmentally sustainable society. As we transition into the exascale era, marked by an increasing demand and scale of HPC resources, the HPC community must embrace the challenge of reducing carbon emissions from designing and operating modern HPC systems. We describe challenges and highlight different opportunities that can aid HPC sites in reducing the carbon footprint of modern HPC systems.

Workshop

Sustainable Supercomputing

Energy Efficiency

Green Computing

Sustainability

DescriptionProviding a sustainable path for supercomputing is a pressing topic for our community, industry, and governments. Supercomputing has an insatiable appetite for computational cycles, while we face increasing challenges of delivering performance per watt advances with silicon technology trends. All within the context of climate change, the drive toward net-zero, and economic pressures driven by geo-political challenges.

Improving the sustainability of supercomputing provides many opportunities when the end-to-end cycle is considered. From the design of computational circuits and systems; to the power and cooling that is used to operate them, along with the suite of software tools used to administrate, maintain, and raise operational efficiency of HPC systems. All elements of the system must be considered, from compute nodes and interconnects, to IO and storage components of the system.

This workshop will gather users, researchers, hardware and software developers to address opportunities and challenges of sustainability in the supercomputing context.

Workshop

Sustainable Supercomputing – Morning Break

Energy Efficiency

Green Computing

Sustainability

Workshop

Swarm Learning – Privacy Preserving Decentralized Machine Learning

Artificial Intelligence/Machine Learning

Distributed Computing

Workshop

symPACK: A GPU-Capable Fan-Out Sparse Cholesky Solver

Applications

Distributed Computing

Compilers

Heterogeneous Computing

Linear Algebra

Message Passing

Programming Frameworks and System Software

Task Parallelism

Tensors

DescriptionSparse symmetric positive definite systems of equations are ubiquitous in scientific workloads and applications. Parallel sparse Cholesky factorization is the method of choice for solving such linear systems. Therefore, the development of parallel sparse Cholesky codes that can efficiently run on today’s large-scale heterogeneous distributed-memory platforms is of vital importance. Modern supercomputers offer nodes that contain a mix of CPUs and GPUs. To fully utilize the computing power of these nodes, scientific codes must be adapted to offload expensive computations to GPUs.

We present symPACK, a GPU-capable parallel sparse Cholesky solver that uses one-sided communication primitives and remote procedure calls provided by the UPC++ library. We also utilize the UPC++ "memory kinds" feature to enable efficient communication of GPU-resident data. We show that on a number of large problems, symPACK outperforms comparable state-of-the-art GPU-capable Cholesky factorization codes by up to 14x on the NERSC Perlmutter supercomputer.

Paper

SYnergy: Fine-Grained Energy-Efficient Heterogeneous Computing for Scalable Energy Saving

Cloud Computing

Distributed Computing

Energy Efficiency

Performance Measurement, Modeling, and Tools

DescriptionEnergy-efficient computing uses power management techniques such as frequency scaling to save energy. Implementing energy-efficient techniques on large-scale computing systems is challenging. While most modern architectures, including GPUs, are capable of frequency scaling, these features are often not available on large systems.

We propose SYnergy, a novel energy-efficient approach that spans languages, compilers, runtimes, and job schedulers to achieve unprecedented fine-grained energy savings on large-scale heterogeneous clusters. SYnergy defines an extension to the SYCL programming model that allows programmers to define a specific energy goal for each kernel. Through compiler integration and a machine learning model, each kernel is statically optimized for the specific target. The methodology is inherently portable and has been evaluated on both NVIDIA and AMD GPUs. Experimental results show unprecedented improvements in energy and energy-related metrics on real-world applications, as well as scalable energy savings on a 64-GPU cluster.

Posters

Research Posters

sys-sage: A Fresh View on Dynamic Topologies and Attributes of HPC Systems

XO/EX

DescriptionHPC systems are getting ever more powerful, but this comes at the price of increasing system complexity. In order to use HPC systems efficiently, one has to be aware of their architectural details, in particular details of their hardware topology, which is increasingly affected by dynamic runtime settings.

sys-sage is a novel approach providing an infrastructure for storage, correlation, and provision of HW-related system information. It uses information from various well-known sources as well as use-case-specific solutions, and correlates the particular pieces together to provide a full view of a system. The novelty of our approach lies in the ability to capture dynamic environments as well as systems’ complexities, and in enabling greater flexibility in its usage.

sys-sage is publicly available and can be used by many applications. It integrates widely used approaches, such as hwloc or dynamic counter information, and offers user-integration of all other user-specific data sources.

Birds of a Feather

System Software for Quantum Accelerated HPC

Post-Moore Computing

XO/EX

DescriptionAs Quantum Computing, QC, systems mature and make their way out of laboratories into HPC centers as accelerators, we also must rethink the role of system software. We require stable software environments targeted at broad, non-physics end-user communities that are directly integrated into HPC system software as well as HPC schedulers. In this BoF, we will highlight recent developments relating to first QC system installations in HPC centers and discuss open questions and challenges. We aim to establish an international discussion on this emerging, critical issue and to help clear the road for the next steps towards efficient quantum acceleration.

Workshop

TANGO: A GPU-Optimized Traceback Approach for Sequence Alignment Algorithms

Algorithms

Applications

Architecture and Networks

DescriptionSequence alignment algorithms play a central role in most bioinformatics software. However, porting these algorithms to GPUs can be challenging due to their reliance on irregular memory access patterns and integer-heavy operations. Here we present TANGO, an optimized GPU implementation of the Smith-Waterman (SW) algorithm with a focus on the traceback phase. We leverage stacked diagonal-major indexing and compressed binary representation for efficient adaptation of the traceback phase to GPUs. Our proposed implementation achieves speedups of 12.6x and 9.9x compared to state-of-the-art CPU libraries for DNA and protein alignments, respectively. It is the fastest SW library for protein alignments on GPU while providing comparable performance to other GPU libraries for DNA. Finally, we integrate TANGO into a large-scale metagenome assembly software to speed up a production workflow.

Paper

TANGO: Re-Thinking Quantization for Graph Neural Network Training on GPUs

Artificial Intelligence/Machine Learning

DescriptionGraph Neural Networks (GNNs) are rapidly gaining popularity since they hold state-of-the-art performance for various critical graph-related tasks. While quantization is a primary approach to accelerating GNN computation, quantized training faces remarkable challenges. We observe that current quantized GNN training systems often experience longer training time than their full-precision counterparts for two reasons: (i) addressing the accuracy challenge results in too much overhead. (ii) The optimization opportunity exposed by quantization is not well leveraged. This paper introduces Tango, which re-thinks quantization challenges and opportunities for graph neural network training on GPUs with the following contributions: First, we introduce light-weighted rules to meet the accuracy requirement for quantized GNN training. Second, we design and implement quantization-aware primitives and inter-primitive optimizations to accelerate GNN training. Third, we integrate Tango with the mainstream Deep Graph Library (DGL) system and demonstrate that Tango outperforms the state-of-the-art across all the evaluated GNN models and datasets.

Workshop

Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware Accelerators

Algorithms

Heterogeneous Computing

Large Scale Systems

DescriptionWe investigate a new task-based implementation of the polar decomposition on massively parallel systems augmented with multiple GPUs using SLATE. We implement the iterative QR Dynamically-Weighted Halley (QDWH) algorithm, whose building blocks mainly consist of compute-bound matrix operations, allowing for high levels of parallelism to be exploited on various hardware architectures, such as NVIDIA, AMD, and Intel GPU-based systems. To achieve both performance and portability, we implement our QDWH-based polar decomposition in the SLATE library, which uses efficient techniques in dense linear algebra, such as 2D block cyclic data distribution and communication-avoiding algorithms, as well as modern parallel programming approaches, such as dynamic scheduling and communication overlapping, and uses OpenMP tasks to track data dependencies.

We report numerical accuracy and performance results. The benchmarking campaign reveals up to an 18-fold performance speedup of the GPU accelerated implementation compared to the existing state-of-the-art implementation for the polar decomposition.

Workshop

TaskVine: Managing In-Cluster Storage for High-Throughput Data Intensive Workflows

Data Analysis, Visualization, and Storage

Large Scale Systems

Programming Frameworks and System Software

Reproducibility

Resource Management

Runtime Systems

DescriptionMany scientific applications are expressed as high-throughput workflows that consist of large graphs of data assets and tasks to be executed on large parallel and distributed systems. A challenge in executing these workflows is managing data: both datasets and software must be efficiently distributed to cluster nodes; intermediate data must be conveyed between tasks; output data must be delivered to its destination. Scaling problems result when these actions are performed in an uncoordinated manner on a shared filesystem. To address this problem, we introduce TaskVine: a system for exploiting the aggregate local storage and network capacity of a large cluster. TaskVine tracks the lifetime of data in a workflow --from archival sources to final outputs-- making use of local storage to distribute and re-use data. We describe the architecture and novel capabilities of TaskVine, and demonstrate its use with applications in genomics, high energy physics, molecular dynamics, and machine learning.

Birds of a Feather

TCHPC Career Panel

Education

XO/EX

DescriptionThis BoF will be in the form of a panel consisting of representatives from the industry, national labs, and academia with a background in HPC. The panel will share advice on different career options in HPC, and their experiences in their respective career trajectories. The primary audience for this event is current, preferably ABD, graduate students. The format will include a brief introduction by each speaker, followed by a moderated discussion based on a set of previously submitted questions and ending with further questions from the audience.

Workshop

Teaching Heterogeneous and Parallel Computing with Google Colab and Raspberry Pi Clusters

Education

Heterogeneous Computing

Reproducibility

State of the Practice

DescriptionIn this paper, we describe our experience of teaching Heterogeneous and Parallel Computing with Google Colab and Raspberry Pi Clusters in a senior elective course in Spring 2023. After introductory lectures, while the whole class learned CUDA on Google Colab for five and half weeks, in parallel, a team of two students spearheaded a pilot project as their undergraduate research project to build, configure, and test a cluster of four Raspberry Pi’s. Then the rest of the class followed suit to build their own clusters in teams using the tutorials developed through the pilot project. Thanks to these clusters, in the next seven weeks, the class went on to learn OpenMP and MPI on various scales. Students’ performance on the labs and assignments, their end-of-semester evaluations, and three anonymous surveys were collected as data to produce an evaluation of the course, which is presented in the end of the paper.

Workshop

Teaching Non-Determinism in High Performance Applications

Education

State of the Practice

Sustainability

DescriptionIncreasing performance in data workflows can cause non-deterministic communication. Non-determinism can seriously affect software correctness and compromise reproducibility in scientific discovery. We design and implement tutorial modules to demonstrate the impact of non-determinism in data science workflows. We use ANACIN-X, a framework of test cases and tools for analytics and visualization. By completing our modules, students, researchers, and data science professionals will understand non-determinism, how it affects their applications, how to quantify it, and how to identify its root sources.

Posters

Research Posters

Temporal Classification of Allocations for Reduced Memory Usage

XO/EX

DescriptionUmpire, a data and memory management API created at LLNL, provides memory pools which enable less expensive ways to allocate large amounts of memory in HPC environments. Memory pools commonly contain both allocations that persist for only a portion of the program (temporary) and those that persist for the entire program (permanent). However, too much of a mix of both allocation types can lead to pool fragmentation and cause the pool to perform poorly. Umpire created a tool that uses a machine learning model to perform temporal classifications and categorize allocations as either temporary or permanent. We conducted experiments using trace files from two LLNL applications to study how much memory can be saved when those allocations are separated into distinct pools. We found that our ML tool accurately classifies memory allocations and that separating these allocation types into distinct pools reduces overall memory usage significantly (up to 29.5%).

Workshop

Tencoder: Tensor-Product Encoder-Decoder Architecture for Predicting Solutions of PDEs with Variable Boundary Data

Artificial Intelligence/Machine Learning

DescriptionIt is widely hoped that artificial intelligence will boost data-driven surrogate models in science and engineering. However, fundamental spatial aspects of AI surrogate models remain under-studied. We investigate the ability of neural-network surrogate models to predict solutions to PDEs under variable boundary values. We do not wish to retrain the model when the boundary values change but to make them inputs to the model and infer the solution of the PDE under those boundary conditions. Such a capability is essential to making AI-based surrogate models practically useful. While simple feed-forward networks are used for one-dimensional (1D) Poisson equation, an encoder-decoder architecture with a tensor-product layer is developed for the two-dimensional Poisson equation posed on a rectangular domain. We show that it is indeed possible to infer solutions to PDEs from variable boundary data using neural networks in this relatively simple setting, and point to future directions.

Workshop

Tensor Cores for Matrix Multiplication Are on the Rise – Is There Any Hope for Sparse Operations?

Graph Algorithms and Frameworks

Linear Algebra

Programming Frameworks and System Software

State of the Practice

DescriptionIn response to the demands from the machine learning community, an increasing number of hardware architectures feature tensor cores for high performance dense matrix multiplication in low precision. This is increasing the peak performance and power draw - and the expectation that applications achieve higher performance. But at the same time, many application sparse linear algebra is generally unable to exploit the high performance tensor cores. Is there any hope for improved performance in applications based on sparse linear algebra?

Workshop

Tenth SC Workshop on Best Practices for HPC Training and Education

Education

State of the Practice

DescriptionThe inherent wide distribution, heterogeneity, and dynamism of the current and emerging high-performance computing and software environments increasingly challenge cyberinfrastructure facilitators, trainers, and educators. The challenge is how to support and train the current diverse users and prepare the future educators, researchers, developers, and policymakers to keep pace with the rapidly evolving HPC environments to advance discovery and economic competitiveness for many generations.

The tenth annual full-day workshop on HPC training and education is an ACM SIGHPC Education Chapter coordinated effort, aimed at fostering more collaborations among the practitioners from traditional and emerging fields to explore educational needs in HPC, to develop and deploy HPC training, and to identify new challenges and opportunities for the latest HPC platforms. The workshop will also be a platform for disseminating results and lessons learned in these areas and will be captured in a Special Edition of the Journal of Computational Science Education.

Workshop

Tenth Workshop on Accelerator Programming and Directives (WACCPD2023) – Closing Remarks

Accelerators

Compilers

Heterogeneous Computing

Programming Frameworks and System Software

Runtime Systems

Workshop

Tenth Workshop on Accelerator Programming Using Directives (WACCPD 2023)

Accelerators

Compilers

Heterogeneous Computing

Programming Frameworks and System Software

Runtime Systems

DescriptionHeterogeneous node architectures are becoming omnipresent in today’s HPC systems. Exploiting the maximum compute capability out of such systems, while also maintaining code portability and
maintainability, necessitates accelerator programming approaches such as OpenMP offloading, OpenACC, standard C++/Fortran parallelism, SYCL, DPC++, Kokkos, RAJA. However, the use of these programming approaches remains a research activity and there are many possible trade-offs between performance, portability, maintainability, and ease of use that must be considered for optimal use of accelerator-based HPC systems.

Toward this end, the workshop will highlight the improvements over state-of-the-art through the accepted papers and talks. In addition, the event will foster discussion with a keynote/panel to draw the community’s attention to key areas that will facilitate the transition to accelerator-based HPC. The workshop aims to showcase all aspects of innovative high-level language features, lessons learned while using directives/abstractions to migrate scientific legacy code, experiences using novel accelerator architectures, among others.

Posters

Research Posters

That's Right – The Same C++ STL Asynchronous Parallel Code Runs on CPUs and GPUs

XO/EX

DescriptionHigh-performance computing applications running on modern-day supercomputers frequently encounter performance and portability challenges especially if using multiple programming models, languages and compilers. In this work, we explore the proposed C++26 language standard model for asynchronous parallelism, called std::execution or stdexec, powered with stdpar, std::mdspan, among other C++23 features, to port and analyze multiple scientific HPC applications on CPUs and GPUs. These applications include sequence alignment codes from ADEPT and heat transfer from AMReX. Our experiments depict near-native performance for our ported implementations on NVIDIA A100 GPUs running on the Perlmutter supercomputer. We also study and analyze the data transfer traffic patterns and overheads between the host and device for stdpar and provide helpful insights in application performance. Finally, we discuss some challenges and limitations encountered while porting these apps to C++26 with stdexec, as well as their workarounds, until the stdexec is fully integrated and function in the NVHPC compilers.

Workshop

The 18th Workshop on Workflows in Support of Large-Scale Science (WORKS23)

Applications

Cloud Computing

Distributed Computing

Edge Computing

Large Scale Systems

DescriptionScientific workflows have underpinned some of the most significant discoveries of the past several decades. Workflow management systems provide abstraction and automation which enable a broad range of researchers to easily define sophisticated computational processes and to then execute them efficiently on parallel and distributed computing systems. Workflows are becoming more complex and require more sophisticated workflow management capabilities.

This workshop focuses on the many facets of scientific workflow management systems, ranging from actual execution to service management and the coordination and optimization of data, service, and job dependencies. The workshop covers a broad range of issues in the scientific workflow lifecycle that include: scientific workflows representation; workflow scheduling techniques to optimize the execution on heterogeneous infrastructures; provisioning workflows on infrastructures; workflow engines that deal with failures in the application and infrastructure; and computer science problems related to scientific workflows such as semantic technologies, compiler methods, fault tolerance, etc.

Workshop

The 6th Annual Parallel Applications Workshop, Alternatives to MPI+X (PAW-ATM)

Applications

Distributed Computing

Compilers

Heterogeneous Computing

Message Passing

Programming Frameworks and System Software

Task Parallelism

DescriptionSupercomputers get faster and more complex every year. MPI, long the dominant model for distributed computation, has adapted by combining with models for intra-node parallelism (e.g. OpenMP, CUDA). These MPI+X hybrids offer performance but demand significant programmer effort to write, debug and tune applications.

Alternatives to MPI+X are worth exploring as programmer productivity becomes a major component of the time to science. Alternatives include parallel programing languages (e.g. Chapel, Regent, Fortran 2018), general purpose libraries (e.g. Charm++, COMPSs, HPX, Legion, UPC++), and domain specific libraries (e.g. Arkouda, Dask, Spark). With many options to choose from, it is hard for programmers to know which alternative models are appropriate for their application and for programming model developers to understand the opportunities for improvement.

Through discussion of specific applications, PAW-ATM brings together application experts and programming model developers to improve applications and models.

Workshop

The 9th International Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD-9)

Data Analysis, Visualization, and Storage

Data Compression

DescriptionIn this new exascale computing era, applications must increasingly perform online data analysis and reduction—tasks that introduce algorithmic, implementation, and programming model challenges that are unfamiliar to many scientists and that have major implications for the design and use of various elements of exascale systems. There are at least three important topics that our community is striving to answer: (1) whether several orders of magnitude of data reduction are possible for exascale sciences; (2) understanding the performance and accuracy trade-off of data reduction; and (3) solutions to effectively reduce data while preserving the information hidden in large scientific data. Tackling these challenges requires expertise from computer science, mathematics, and application domains to study the problem holistically, and develop solutions and hardened software tools that can be used by production applications. DRBSD-9 is a great venue to publish and share the latest research findings and achievements in this critical research topic.

Early Career Program

Inclusivity

The Art of the Pitch

Inclusivity

DescriptionIn this workshop, participants will be provided an overview of the different types of elevator pitches. There will be tips on posture, presence, and perspective. Participants will be provided with a worksheet to sketch out their ideas and of course, practice their pitch!

Workshop

The BEAST LAB: A Practical Course on Experimental Evaluation of Diverse Modern HPC Architectures and Accelerators

Education

State of the Practice

DescriptionGiving students a good understanding of how micro-architectural effects impact achievable performance for given HPC workloads is essential. It enables them to find effective optimization strategies and to reason about sensible approaches towards better efficiency. This paper describes a lab course held in collaboration between LRZ, LMU and TUM. The course was born with a dual motivation in mind: filling a gap in educating students to become HPC experts, as well as understanding the stability and usability of emerging HPC programming models for recent CPU and GPU architectures with the help of students. We describe the course structure used to achieve the goals, resources made available to attract students, and experiences and statistics from running the course now for six semesters. We conclude with an assessment of how successfully the lab course could meet the vision.

Workshop

The Code-a-Thon, Improving Student Engagement through Community Coding

Education

State of the Practice

Description"Learning by Doing" also known as Active Learning is a hands-on, experiential approach to learning that involves actively engaging in tasks or activities to acquire knowledge and skills. TACC has been incorporating "Learning by Doing" into introductory advanced computing topics termed Code-a-thons. This approach has greatly increased students' knowledge base in advanced computing through problem solving, debugging, and implementing. Three major student outcomes from our code-a-thons have been:

* Increased Retention and Understanding
* Skill Development
* Critical Thinking and Problem-Solving

Code-a-thons have promoted a more engaging and practical learning experience that encourages learners to become active participants in their own education, leading to a deeper and more comprehensive understanding of scientific computing. We will discuss the TACC Code-a-thon model: the benefits and detail our implementation, and share the real world projects used during our community coding events and feedback from our students.

Workshop

The Common Workflow Scheduler Interface: Status Quo and Future Plans

Data Analysis, Visualization, and Storage

Large Scale Systems

Programming Frameworks and System Software

Reproducibility

Resource Management

Runtime Systems

DescriptionNowadays, many scientific workflows from different domains, such as Remote Sensing, Astronomy, and Bioinformatics, are executed on large computing infrastructures managed by resource managers. Scientific workflow management systems (SWMS) support the workflow execution and communicate with the infrastructures' resource managers. However, the communication between SWMS and resource managers is complicated by a) inconsistent interfaces between SMWS and resource managers and b) the lack of support for workflow dependencies and workflow-specific properties.

To tackle these issues, we developed the Common Workflow Scheduler Interface (CWSI), a simple yet powerful interface to exchange workflow-related information between a SWMS and a resource manager, making the resource manager workflow-aware. The first prototype implementations show that the CWSI can reduce the makespan already with simple but workflow-aware strategies up to 25%. We show how existing workflow resource management research can be integrated into the CWSI.

Exhibits

Flash Session

The Convergence of HPC, AI, and Quantum

XO/EX

DescriptionBoston discuss how HPC, quantum computing, and AI are paradigms that require massive amounts of computing power, highly parallelizable, and require sophisticated software development tools and techniques. They converge in a number of ways, with AI being used to optimize HPC and quantum workflows, computers used to accelerate AI workloads, and HPC being used to train and deploy large AI models.

Exhibitor Forum

The Cost of Flexibility and Security in Cloud-Based HPC – A Case Study Running EDA Workloads with Confidential Computing Technology

Architecture and Networks

Cloud Computing

XO/EX

DescriptionDesign of modern very large scale integrated circuits (VLSI) using electronic design automation (EDA) is an increasingly compute intensive and complex endeavor. Because of typical product cycles in chip design, EDA is an excellent candidate for offloading bursts of computations to cloud-based resources when close to design deadlines, to reduce infrastructure cost and improve flexibility by offering virtually unlimited computational power on-demand. However, running EDA workloads poses significant security risks, due to the designers’ intellectual property (IP) and high-value foundry process design kits (PDKs). The cost of a leaked proprietary design is measured in millions of dollars, loss of competitiveness and brand damage. To guarantee security of these highly valuable assets, all data and computations in the EDA workloads must be secured. Traditionally, encryption has been an effective solution to protect data at rest and in motion; however, data in use has so far seen less secure solutions based mostly on virtualization. Emerging confidential computing techniques can improve this aspect by providing truly isolated and encrypted environments for the computations. However, as of today, there is no comprehensive study on the challenges of running HPC workloads in confidential enclaves, and on how to deploy confidential computing in the public cloud. This talk focuses on EDA workloads as a proxy to generic HPC workloads that need thousands of cores, high-bandwidth network communication and shared storage. We present our experience at running cloud-native EDA workloads in confidential VMs through the use of Confidential Containers, that allows a zero-effort conversion of cloud-native workloads. We will briefly discuss existing and novel mechanisms to integrate the data-in-use protection of Confidential Containers with secure private/shared storage and network. Then, we will focus on measuring and characterizing the performance overhead of protecting data in every stage of the computation.

Birds of a Feather

The Future of Benchmarks in Supercomputing

Performance Measurement, Modeling, and Tools

XO/EX

DescriptionAs supercomputing welcomes new workflows of simulations, data science and artificial intelligence in the Exascale era, the goal of this session is to pose, engage, debate, and address the question - "How should the SC community evolve performance benchmarks?". The session will be organized as presentations and panel discussions with audience participation that will invite active members of the Top500, HPCG, MLPerf, TeraSort, etc. and key personnel from industry, academia, and government to discuss the value, need and desire for evolving the benchmark suite that is inclusive and accommodative of emerging applications to guide future supercomputing system design and architecture.

Workshop

The Future of Machine Learning is Sparse

Graph Algorithms and Frameworks

Linear Algebra

Programming Frameworks and System Software

State of the Practice

DescriptionAs machine learning models become prohibitively expensive to train and use, sparsity is increasingly essential for neural networks such as large language models. If performed correctly, sparsity does not come at the cost of performance, and pruned deep neural networks have shown benefits in both throughput and generalization. In this talk, we will review the many faces of sparsity in deep learning and how it can be leveraged, from input representation to specialized hardware. We will discuss the differences in sparsity structure from other HPC applications, learning schedules that enable near-lossless sparse training, and libraries that aid in introducing high-performance sparsity to existing machine learning models. Lastly, we will outline the current limitations of sparsity in deep learning and discuss future opportunities in pushing those boundaries.

Birds of a Feather

The Future of NSF Supported Advanced Cyberinfrastructure

HPC in Society

XO/EX

DescriptionThe National Science Foundation's vision and investment plans for cyberinfrastructure (CI) are designed to address the evolving needs of the science and engineering research community. Senior leadership and program staff from NSF’s Office of Advanced Cyberinfrastructure (OAC) will discuss OAC's vision, strategic and national priorities, as well as the latest funding opportunities across all aspects of the research cyberinfrastructure ecosystem. Substantial time will be devoted to Q&A between attendees and NSF staff.

Panel

The Golden Age of Compilers: Analyzing Cross-Cutting Issues and Opportunities across HPC and AI Domains

Artificial Intelligence/Machine Learning

Compilers

Performance Optimization

TUT

XO/EX

DescriptionThis panel discussion aims at identifying cross-cutting issues, opportunities, similarities, and discrepancies between HPC and AI workloads and systems, as well as defining the role of compilers in the development of HPC applications and AI models. While there is a clear overlap in problems being solved in HPC and AI communities, often solutions are siloed to one, with software fragmentation and increased maintenance cost. It has become critical to identify current gaps and potential solutions in current compiler frameworks and to develop an interoperable environment to help researchers move to the next stage of scientific discoveries, such as moving from classification models to machine reasoning. This panel brings together the experience of distinguished researchers from industry, academia, the U.S. National Laboratory, and the U.S. Department of Energy, to share their vision, identify current gaps and research opportunities, and define a future research agenda.

Paper

The Graph Database Interface: Scaling Online Transactional and Analytical Graph Workloads to Hundreds of Thousands of Cores

Cloud Computing

Data Analysis, Visualization, and Storage

Graph Algorithms and Frameworks

Best Paper Finalist

DescriptionGraph databases (GDBs) are crucial in academic and industry applications. The key challenges in developing GDBs are achieving high performance, scalability, programmability, and portability. To tackle these challenges, we harness established practices from the HPC landscape to build a system that outperforms all past GDBs presented in the literature by orders of magnitude, for both OLTP and OLAP workloads. For this, we first identify and crystallize performance-critical building blocks in the GDB design, and abstract them into a portable and programmable API specification, called the Graph Database Interface (GDI), inspired by the best practices of MPI. We then use GDI to design a GDB for distributed-memory RDMA architectures. Our implementation harnesses one-sided RDMA communication and collective operations, and it offers architecture-independent theoretical performance guarantees. The resulting design achieves extreme scales of more than a hundred thousand cores. Our work will facilitate the development of next-generation extreme-scale graph databases.

Birds of a Feather

The Green500: Trends in Energy-Efficient Supercomputing

Energy Efficiency

XO/EX

DescriptionWith power being a first-order design constraint on par with performance, it is important to measure and analyze energy-efficiency trends in supercomputing. To raise the awareness of greenness as a first-order design constraint, the Green500 seeks to characterize the energy-efficiency of supercomputers for different metrics, workloads, and methodologies. This BoF discusses trends across the Green500 and highlights from the current Green500 list. In addition, the Green500, Top500, and Energy Efficient HPC Working Group have been working together on improving power-measurement methodology, and this BoF presents recommendations for changes to sampling rates that will improve ease of submission without compromising accuracy.

Workshop

The History and Future of Making HPC Technologies Accessible to the Wider Community

State of the Practice

DescriptionFor the past four decades, HPC was the driver for advanced computing technologies and these advancements trickled down to servers, laptops, and even phones. But the world is changing and the driving forces influencing future computing technologies is shifting to hyperscalar cloud providers and AI large memory models. This talk will reflect on how HPC drove compute speeds, bandwidth, memory and storage in the past, and talk about the future of HPC technology accessibility in this changing landscape.

Workshop

The I/O Trace Initiative: Building a Collaborative I/O Archive to Advance HPC

Data Analysis, Visualization, and Storage

Data Movement and Memory

DescriptionHPC application developers and administrators need to understand the complex interplay between compute clusters and storage systems to make effective optimization decisions. Ad hoc investigations of this interplay based on isolated case studies can lead to conclusions that are incorrect or difficult to generalize. The I/O Trace Initiative aims to improve the scientific community's understanding of I/O operations by building and making available a searchable collaborative archive of I/O traces from a wide range of applications and machines, with a focus on high performance computing and scalable AI/ML. This initiative advances the accessibility of I/O trace data by enabling users to locate and compare traces based on user-specified criteria. It also provides a visual analytics platform for in-depth analysis, paving the way for the development of advanced performance optimization techniques. By acting as a hub for trace data, the initiative fosters collaborative research by encouraging data sharing and collective learning.

Panel

The Impact of Exascale and the Exascale Computing Project on Industry

Artificial Intelligence/Machine Learning

Applications

Exascale

DescriptionExascale computing promises broad advances in simulation, data analytics, and machine learning. The US Department of Energy (DOE) is funding the Exascale Computing Project (ECP) to develop the applications, software, and integration needed to harness the immense computing power of exascale machines. As part of this effort, the ECP established the Industry and Agency Council (IAC), made up of executives from US industry, US government agencies and US independent software vendors (ISVs). As the ECP winds down, this panel is a chance for IAC members to reflect on how the ECP and the move to exascale computing is impacting industry’s current and planned use of HPC in saving energy, boosting competitiveness, and building global technology leadership. Moderated by Fran Hill (Chief Scientist for DoD’s HPC Modernization Program), the panel will be a lively and informative discussion of how exascale and the ECP are impacting businesses both large and small.

Posters

Research Posters

The Impact of Process Topology on RMA Programming Models: A Study on NERSC Perlmutter

XO/EX

DescriptionRemote Memory Access (RMA) provides an alternate mechanism for data movement by separating communication with synchronization, exposing remote memory access features via one-sided communication semantics to a global address space. Performance of the most popular asynchronous RMA interfaces like MPI RMA and SHMEM has steadily improved over the past years due to better software/hardware support from the vendors and community-driven programming model standardization efforts.

Current RMA benchmarking efforts are mostly focused on investigating elementary data movement overheads between a process-pair within and across nodes, not considering a specific process topology. Distributed-memory applications on the other hand must deal with overlapped data distributions, which governs the underlying topology of the processes. We discuss the performance of SHMEM and MPI RMA (in comparison with MPI point-to-point) for grid and graph process topologies on NERSC Perlmutter supercomputer, demonstrating average and 99th percentile latencies.

Birds of a Feather

The International Post-Exascale (InPEx) Project

Codesign

Exascale

XO/EX

DescriptionEfficient use of exascale systems for large-scale applications implies the development, in a combined manner, of applications, the full software stack, and the machine. As the BoF organizers did in the context of IESP and BDEC workshops (exascale.org), we plan to launch a new series of workshops that will gather stakeholders in Europe (EuroHPC, French NumPEX project, BSC, JSC), USA (DOE, NSF partners), Japan (FugakuNEXT, Riken-CC) and large-scale applications communities to target the co-design of software and hardware components of future exascale systems and preparing the major scientific and industrial application domains to fully exploit the capabilities of these systems.

Posters

Research Posters

The Many Facets of a Dynamic Graph Processing System

XO/EX

DescriptionGraphs are used to model real-world systems that often evolve over time. We have developed a streaming graph framework which, while ingesting an unbounded stream of events mirroring a graph's evolution, dynamically updates the solution to a user query, and is able to offer, on-demand and with low latency, the solution to the query. Integral to our framework is that graph topology changes and algorithmic messages are processed concurrently, asynchronously, and autonomously (i.e., without shared state). This poster uses graph coloring as a challenge problem to highlight two advantages of our framework beyond those showcased by past work (i.e., low result latency, high sustained ingestion throughput, and scalability). These additional advantages are: (i) the ability to efficiently leverage the "free" computational resources available when the rate of incoming topology events is below the maximum sustainable throughput, and (ii) the ability to produce "stable" solutions to queries as the graph evolves.

Workshop

The MI300 APU: Programming for CPUs and GPUs on a Single Package

Large Scale Systems

Middleware and System Software

Programming Frameworks and System Software

DescriptionModern extreme scale computing systems rely on heterogeneous CPU and GPU architectures. While this design has enabled several remarkable achievements in high-performance computing, applications running at exascale have already identified multiple opportunities where this paradigm can be improved; notably, the communication costs, and the complexity of the resultant programming model, incurred by the presence of two isolated memory spaces for CPU and GPU. To address these challenges, AMD has developed the Instinct MI300 APU (Accelerated Processing Unit) architecture, which integrates CPU and GPU processing elements on the same system on a chip (SoC). This talk will discuss programmability advantages, and future possibilities, afforded by the MI300 for Exascale computing, including: the improved simplicity of porting from CPU codes and performance benefits resulting from close integration of CPU and GPU compute elements. These simplifications and improvements are realized in a variety of tools, including the RAJA and Kokkos accelerator abstraction frameworks, a recently developed Standard Parallelism interface to AMD APUs, and automatic offload of libraries.

Tutorial

The OpenMP Common Core: A “Hands-On” Introduction

Algorithms

Programming Frameworks and System Software

Task Parallelism

TUT

DescriptionOpenMP is the de facto standard for writing parallel applications for shared memory computers. Born ~25 years ago in 1997, it runs on just about every shared memory platform on the market. It’s also very complicated. We created OpenMP to be the “simple API” for application programmers. With a specification running to over 600 pages OpenMP has grown into an intimidating API viewed by many as for “experts only”.

Most OpenMP programmers, however, use around 21 items from the specification. We call these 21 items the “OpenMP Common Core”. By focusing on the common core, we make OpenMP what it was always meant to be: a simple API for parallel application programmers.

In this hands-on tutorial, we explore the OpenMP Common Core. We utilize active learning through a carefully selected set of exercises, so students will master the Common Core and learn to apply it to their own problems. Students will use their own laptops (with Windows, Linux, or OS/X) to access remote systems that support OpenMP (a remote SMP server). Alternatively, students can load an OpenMP compiler onto their laptops before the tutorial. Information about OpenMP compilers is available at www.openmp.org.

ACM Gordon Bell Finalist

Awards

The Simple Cloud-Resolving E3SM Atmosphere Model Running on the Frontier Exascale System

DescriptionWe present an efficient and performance portable implementation of the Simple Cloud Resolving E3SM Atmosphere Model (SCREAM). SCREAM is a full featured atmospheric global circulation model with a nonhydrostatic dynamical core and state-of-the-art parameterizations for microphysics, moist turbulence and radiation. It has been written from scratch in C++ with the Kokkos library used to abstract the on-node execution model for both CPUs and GPUs. SCREAM is one of only a few global atmosphere models to be ported to GPUs. As far as we know, SCREAM is the first such model to run on both AMD GPUs and NVIDIA GPUs, as well as the first to run on nearly an entire exascale system (Frontier). On Frontier, we obtained a record setting performance of 1.26 simulated years per day for a realistic cloud resolving simulation.

Workshop

The Story of Spin: Five Years Supporting Science with Container-Based Services at NERSC

DescriptionOriginally launched in 2018, Spin is a user-facing, container-based platform designed for NERSC users to deploy their own science gateways, workflow managers, API endpoints, databases, and other network services to support their scientific projects. Spin users enjoy the ease of use and rapid deployment typical of cloud technologies combined with close proximity to large-scale compute and storage resources; NERSC administrators benefit from managing a common, consolidated service platform with a reduced-privilege container runtime environment.

In just five years, Spin has evolved into an important platform for web services and complex scientific workflows, supporting hundreds of users in over 80 NERSC projects. Along this journey, Spin has undergone a major redeployment to a Kubernetes-based back end, a complete overhaul of its security policy subsystems, a full hardware and storage refresh, and numerous software upgrades.

In this quick talk intended to "spin off" deeper conversations with both users and facilities, we’ll highlight key milestones and lessons learned in the development of Spin, and we'll share plans for the future, including upgraded storage, improved security automation, and an expansion of Spin into the HPC platform itself.

Workshop

The Wide Area Classroom: 24,000 HPC Students and Growing

Education

State of the Practice

DescriptionAs of 2023 we, at PSC, have taught more than 24,000 students over the course of 106 events using the Wide Area Classroom, a novel distributed teaching platform. This has been a successful effort, as gauged by several important metrics. We describe both the technical and logistical structure of these events as well as the specific HPC curricula which have proven to be most popular.

Workshop

The World's Worst Optical NIC

Education

State of the Practice

Sustainability

DescriptionComputer networking is mostly experienced as an invisible, mysterious, and ubiquitous presence. Core problems in networking are grounded in the physicality of the network, which most students will not have experience or intuition with when taking a networks course. To better ground students in the physicality of computer networking, this work describes an experimental course delivered in Fall 2022 using custom built hardware.

A feature poor optical NIC was developed and students were directed to "reinvent the Internet''. Challenges were encountered with the NICs during delivery that required some substantial changes to the course during delivery. Overall, the course was successful at the intended goals.

Exhibits

Flash Session

Think Differently about AI Data Centers

XO/EX

DescriptionMoshe lays out the future of AI in the world’s data centers. Think differently about what’s needed and what’s now possible to make inference AI technology systems economically viable and environmentally sustainable now and in the future.

Workshop

Third International Symposium on Quantitative Codesign of Supercomputers

Codesign

Hardware Technologies

Large Scale Systems

Software Engineering

DescriptionThis symposium aims at combining two methodologies—collaborative codesign and data-driven analysis—to realize the potential of supercomputing more fully. We refer to the design solutions that rely on intelligence from data-driven insights across applications, systems, system software, workflows, and facilities as Quantitative Codesign of Supercomputers (QCSC). We seek to bring together the community to overcome challenges in extracting meaning from data across such wide-ranging sources. For SC23, our focus will be on opportunities and challenges in QCSC arising from the explosion of new architectural directions in and new paradigms for HPC. Experts with interact with the community on how directions in AI/ML, Cloud, and HPC will change the computing landscape and how we can still get comparative and meaningful quantitative insight across the expanding space of use cases, programming paradigms, and architectures.

Workshop

TISCC: A Surface Code Compiler and Resource Estimator for Trapped-Ion Processors

Quantum Computing

Software Engineering

DescriptionWe introduce the Trapped-Ion Surface Code Compiler (TISCC), a software tool that generates circuits for a universal set of surface code patch operations in terms of a native trapped-ion gate set. To accomplish this, TISCC manages an internal representation of a trapped-ion system where a repeating pattern of trapping zones and junctions is arranged in an arbitrarily large rectangular grid. Surface code operations are compiled by instantiating surface code patches on the grid and using methods to generate transversal operations over data qubits, rounds of error correction over stabilizer plaquettes, and/or lattice surgery operations between neighboring patches. Beyond the implementation of a basic surface code instruction set, TISCC contains corner movement functionality and a patch translation that is implemented using ion movement alone. Except in the latter case, all TISCC functionality is extensible to alternative grid-like hardware architectures. TISCC output has been verified using the Oak Ridge Quasi-Clifford Simulator (ORQCS).

Workshop

Top 5 Challenges  in Programming Models and Runtimes for Large Language Models Training/Inference

Large Scale Systems

Middleware and System Software

Programming Frameworks and System Software

DescriptionIn this panel, we focus on the challenges in programming models and runtime system for large language model training/inference. We invite researchers across academia, national labs, and industry to share their experience and vision on programming tools, runtime performance, architecture, optimization, scalability, I/O, data, and communication to facilitate LLMs on supercomputers. The discussion will cover LLM pretraining, fine-tuning, deployment, and usage in science. We will identify the Top 5 challenges across these areas.

Birds of a Feather

TOP500 Supercomputers

Exascale

XO/EX

DescriptionThe TOP500 list of supercomputers serves as a “Who’s Who” in the field of High Performance Computing (HPC). It started as a list of the most powerful supercomputers in the world and has evolved to a major source of information about trends in HPC. The 62nd TOP500 list will be published in November 2023 just in time for SC23.

This BoF will present detailed analyses of the TOP500 and discuss the changes in the HPC marketplace during the past years. The BoF is meant as an open forum for discussion and feedback between the TOP500 authors and the user community.

Workshop

Tournament-Based Pretraining to Accelerate Federated Learning

Artificial Intelligence/Machine Learning

DescriptionAdvances in hardware, proliferation of compute at the edge, and data creation at unprecedented scales have made federated learning (FL) necessary for the next leap forward in pervasive machine learning. For privacy and network reasons, large volumes of data remain stranded on endpoints located in geographically austere (or at least austere network-wise) locations. However, challenges exist to the effective use of these data. To solve the system and functional level challenges, we present an three novel variants of a serverless federated learning framework. We also present tournament-based pre-training, which we demonstrate significantly improves model performance in some experiments. Overall, these extensions to FL and our novel training method enable greater focus on science rather than ML development.

Birds of a Feather

Toward a National Artificial Intelligence (AI) Research Resource for Strengthening and Democratizing AI R&D

Artificial Intelligence/Machine Learning

XO/EX

DescriptionAI is driving scientific discovery and economic growth. While AI R&D is advancing rapidly, access to the computational and data resources that drive the frontiers of AI remains limited. This BoF will explore how democratizing access to national-level cyberinfrastructure (CI) for AI R&D can help strengthen the AI research and innovation ecosystem. Specifically, this BoF will catalyze a discussion about the nature and composition of such CI, how it can be realized nationally and connected internationally, how to measure both successes and failures, and what are necessary guardrails to ensure responsible AI.

Workshop

Toward a Peer-to-Peer Data Distribution Layer for Efficient and Collaborative Resource Optimization of Distributed Dataflow Applications

Data Analysis, Visualization, and Storage

Data Movement and Memory

DescriptionOptimizing the underlying cluster configurations of distributed data processing frameworks can be complex and often requires performance modeling techniques due to the multitude of performance-affecting factors. While these approaches may not always be applicable due to the need for substantial training data, at the same time, data analytics jobs oftentimes share common characteristics, such as algorithm implementations, which suggest the potential for collaborative performance modeling. Current collaborative approaches, however, mainly assume a centralized storage infrastructure, which comes with its own potential drawbacks, i.e., with regard to data privacy, storage costs, or system maintenance. We envision a peer-to-peer-based data distribution layer, facilitating data sovereignty, failure resilience, and means of ad-hoc collaboration, thereby fostering cross-context resource optimization approaches for big data analytics.

Workshop

Toward a Scalable In Situ Fast Fourier Transform

Data Analysis, Visualization, and Storage

Large Scale Systems

Performance Measurement, Modeling, and Tools

DescriptionThe Fast Fourier Transform (FFT) is a numerical operation that transforms a function into a form comprised of its constituent frequencies and is an integral part of scientific computation and data analysis. The objective of our work is to enable use of the FFT as part of a scientific in situ processing chain to facilitate the analysis of data in the spectral regime. We describe the implementation of an FFT endpoint for the transformation of multi-dimensional data within the SENSEI infrastructure. Our results show its use on a sample problem in the context of a multi-stage in situ processing workflow.

Workshop

Toward a Secure Federated Infrastructure for AI-Accelerated Multi-Facility Science

Artificial Intelligence/Machine Learning

Distributed Computing

Workshop

Toward Collaborative Continuous Benchmarking for HPC

Applications

Exascale

Heterogeneous Computing

Programming Frameworks and System Software

State of the Practice

DescriptionBenchmarking is integral to procurement of HPC systems, communicating HPC center workloads to HPC vendors, and verifying performance of the delivered HPC systems. Currently, HPC bench- marking is manual and challenging at every step, posing a high barrier to entry, and hampering reproducibility of the benchmarks across different HPC systems. We propose collaborative continuous benchmarking to enable functional reproducibility, automation, and community collaboration in HPC benchmarking. We define the minimal requirements for collaborative continuous benchmarking and develop a common language to streamline the interactions between HPC centers, vendors, and researchers. We demonstrate the initial implementation of collaborative continuous benchmarking, and introduce an open source continuous bench-marking repository, Benchpark, for community collaboration. We believe collaborative continuous benchmarking will help overcome the human bottleneck in HPC benchmarking, enabling better evaluation of our systems and enabling a more productive collaboration within the HPC community.

Workshop

Toward Correctness Checking of MPI Partitioned Communication in MUST

Applications

Software Engineering

DescriptionPartitioned communication introduced with MPI 4.0 can improve the communication efficiency of hybrid parallel models. It allows threads on the sender and the receiver side to work on parts of a communication buffer before the communication operation is fully completed. In this presentation, we discuss which kind of erroneous usage patterns are possible with MPI partitioned communication and provide a set of example test cases to be integrated in future classification quality benchmark suites. Further, we explain how we implemented first basic checks specific to partitioned communication, namely argument errors and erroneous partition activation, in our correctness checking tool MUST. The evaluation on the example test cases shows that MUST can correctly detect the errors and can help users to pinpoint bugs in their application.

ACM Gordon Bell Finalist

Awards

Toward Exascale Computation for Turbomachinery Flows

DescriptionA state-of-the-art large eddy simulation code has been developed to solve compressible flows in turbomachinery. The code has been engineered with a high degree of scalability, enabling it to effectively leverage the many-core architecture of the new Sunway system. A consistent performance of 115.8 DP-PFLOPs has been achieved on a high-pressure turbine cascade consisting of over 1.69 billion mesh elements and 865 billion Degree of Freedoms (DOFs). By leveraging a high-order unstructured solver and its portability to large heterogeneous parallel systems, we have progressed toward solving the grand challenge problem outlined by NASA, which involves a time dependent simulation of a complete engine, incorporating all the aerodynamic and heat transfer components.

Workshop

Toward Foundation Models for Materials Science: The Open MatSci ML Toolkit

Artificial Intelligence/Machine Learning

DescriptionArtificial intelligence and machine learning have shown great promise in their ability to accelerate novel materials discovery. As researchers and domain scientists seek to unify and consolidate chemical knowledge, the case for models with potential to generalize across different tasks within materials science – so-called "foundation models" – grows with ambitions. This manuscript reviews our recent progress with development of Open MatSci ML Toolkit, and details experiments that lay the groundwork for foundation model research and development with our framework. Our key results show that for simple applications, pre-training appears to provide worse modeling performance than training models from random initialization. However, for more complex instances, such as when a model is required to learn across multiple datasets and types of targets simultaneously, the inductive bias from pre-training provides significantly better performance. This insight will hopefully inform subsequent efforts into creating foundation models for materials science applications.

Posters

Research Posters

Toward Inductive Synthesis of Compiler Heuristics: A Case Study with Register Allocation

XO/EX

DescriptionThere have been significant advances in machine learning-driven performance modeling in recent years. One key limitation of such approaches is that their success depends, to a large degree, on the formulation of the outcome or objective, which is typically done by human experts. In this paper, we propose a novel approach of automatically generating new optimization heuristics using inductive program synthesis. To explore the feasibility of this approach, we investigated the graph-coloring register allocation heuristic used in the state-of-the-art compilers today. In particular, we focused on the task of live range splitting. The results show that when using a Genetic Algorithm, we can obtain splitting heuristics that are within 10% of the optimal split after 202 generations.

Workshop

Toward Rapid Autonomous Electron Microscopy with Active Meta-Learning

Artificial Intelligence/Machine Learning

DescriptionIn this work, we developed a method to accelerate computational steering of microscopy experiments by active meta-learning. Before this work, a tailored AI model was trained specifically for every experiment by active learning to reconstruct spectrum and uncover regions of interest by sampling just a few locations of the image. Training individual models for each experiment may result in scalability challenges when dealing with high resolutions data, and complex structure-property relationships often demand deeper AI models. A Reptile algorithm, a first-order, model-agnostic meta-learning approach is used to train on images from prior experiments at different conditions such that the trained model can adapt to new unseen conditions in considerably less time. We observe up to ~30-40% reduction in the number of training epochs for active learning exploration. The benefit for structure-property investigation for spectral reconstruction of STEM EELS nanoparticle plasmonic images is demonstrated across multiple experiments.

Workshop

Toward Standardized, Open Object-Based Computational Storage For Large-Scale Scientific Data Analytics

Data Analysis, Visualization, and Storage

Data Movement and Memory

Paper

Toward Sustainable HPC: Carbon Footprint Estimation and Environmental Implications of HPC Systems

Cloud Computing

Distributed Computing

Energy Efficiency

Green Computing

Programming Frameworks and System Software

State of the Practice

Sustainability

DescriptionThe rapid growth in demand for HPC systems has led to a rise in carbon footprint, which requires urgent intervention. In this work, we present a comprehensive analysis of the carbon footprint of high-performance computing (HPC) systems, considering the carbon footprint during both the hardware manufacturing and system operational stages. Our work employs HPC hardware component carbon footprint modeling, regional carbon intensity analysis, and experimental characterization of the system life cycle to highlight the importance of quantifying the carbon footprint of HPC systems.

Workshop

Toward the Development of a Comprehensive Digital Twin of an Exascale Supercomputer

Codesign

Hardware Technologies

Large Scale Systems

Software Engineering

DescriptionOver the past year, we have embarked upon an ambitious initiative to develop a comprehensive digital twin of the Frontier supercomputer. This twin includes: 3D asset modeling with virtual and augmented reality capabilities, telemetry data assimilation, AI/ML integration, simulations, and reinforcement learning for optimization. Key simulations under development include: (1) a transient simulation of the thermo-fluid cooling system from cooling tower to cold plate, (2) a rectifier loss model predicting heat generation and rectification losses, (3) a job scheduling simulator, and (4) a parallel discrete-event simulator to study network congestion. This digital twin offers insights into operational strategies, "what-if" scenarios, as well as elucidates complex, cross-disciplinary transient behaviors; it also serves as a design tool for future system prototyping. Built on an open software stack (Modelica, SST Macro, Unreal Engine) with an aim to foster community-driven development, we have formed a partnership with CSC Finland to study application fingerprinting on LUMI and are in active discussions with a number of other supercomputer centers who have expressed interest in collaborating for future development.

Workshop

Towards a Massive-Scale Distributed Neighborhood Graph Construction

Algorithms

Applications

Architecture and Networks

DescriptionGraph-based approximate nearest neighbor algorithms have shown high performance and quality. However, such approaches require a large amount of memory and still take a long time to construct high-quality nearest neighbor graphs (NNGs). Using distributed memory systems is important when data is large or a shorter indexing time is desired.

We develop a distributed memory version of NN-Descent, a widely known graph-based ANN algorithm, closely following algorithmic advances by PyNN-Descent authors. Our distributed NN-Descent (DNND) is built on top of MPI and leverages two existing high-performance computing libraries: YGM (an asynchronous communication library) and Metall (a persistent memory allocator).

We evaluate the performance of DNND on an HPC system using billion-scale datasets, demonstrating that our approach shows high performance and strong scaling and has great potential for developing massive-scale NNG frameworks.

Workshop

Towards an Expressive Python-Native Interface for Quantum Program Development

Quantum Computing

Software Engineering

DescriptionIn quantum programming, there is a natural conflict between high-level expression and low-level control. Existing quantum programming solutions optimize for either the expressiveness of quantum programs or ease of composing quantum programs, but not both. In this work, we describe a quantum programming interface called AutoQASM that is Python-native, clean, and expressive for general control flow as well as for low-level and device-dependent quantum instructions. It generates OpenQASM 3.0 programs and integrates with the Amazon Braket software development kit to enable program composition, execution, and analysis in the same environment.

Posters

Research Posters

Towards Enabling Digital Twins Capabilities for a Cloud Chamber

XO/EX

DescriptionParticle-resolved direct numerical simulations (PR-DNS), which resolve not only the smallest turbulent eddies but also track the development and motion of individual particles, are arguably an essential tool for exploring aerosol-cloud-turbulence interactions at the fundamental level. For instance, PR-DNS may complement experimental facilities designed to study key physical processes in a controlled environment and therefore serve as digital twins for such cloud chambers. In this poster we present our ongoing work aimed at enabling the use of a PR-DNS model for this purpose. We consider two approaches: traditional HPC techniques and emerging machine learning methods. Future research directions are outlined as well.

Workshop

Training Experiences by Skills for HPC Ecosystems

Education

State of the Practice

Sustainability

DescriptionSkills in HPC are important for various professional fields. Several initiatives present an organization of the required skills by goals and roles. These visions, built from discussions between specialists and users, propose three main actors: HPC systems engineers, software engineers, and users. Besides the formal courses in different universities, the community has offered summer schools and similar events. In this interest, the specialists and the industry were involved in developing training on computer architectures and programming paradigms. The deployment of different specific schools around the world allows for democratizing knowledge and the creation of previously non-existent collaborations in multidisciplinary and inclusive ways. So, we propose an original non-formal school for training and development by skills, named the Supercomputing and Distributed Camping School (SC-Camp), a non-profit event addressed to students who lack financial backup, with an important focus on practical sessions and itinerant, bringing knowledge to a different country yearly.

Workshop

Transcriptomics Atlas Pipeline: Cloud vs HPC

Applications

Cloud Computing

Distributed Computing

Edge Computing

Large Scale Systems

DescriptionTranscriptomics studies the RNA present in a specific cell or tissue at a given time or condition. This dependence on time makes the problem computationally challenging, as the data generated by transcriptomics experiments is larger than the genomics studies on DNA sequences. The goal of the Transcriptomics Atlas project is to create a database of analyzed RNA sequences corresponding to given tissue and organ types based on the data from public repositories and make it available for researchers. We describe our transcriptomics atlas pipeline as an example of a new data- and compute-intensive scientific workflow. After analyzing the requirements of the tasks in the pipeline, we describe our proposed cloud architecture. We present the preliminary results of the experimental evaluation of the pipeline in the AWS cloud, and compare the performance results to the traditional execution on the HPC cluster.

Posters

Research Posters

Transfer Learning Workflow for High-Quality I/O Bandwidth Prediction with Limited Data

XO/EX

DescriptionThe I/O performance prediction is challenging due to multiple intertwined variables inside a cluster. This situation makes I/O performance prediction a strong candidate for using machine learning because of the complex variables involved. However, making a high-quality prediction requires a large amount of equivalent-quality data, and collecting it is a big challenge for most data centers.

In this project, we explore transfer learning to predict the I/O performance by utilizing the publicly available I/O performance data in Darshan logs from the NCSA's Blue Waters supercomputer. We devise a workflow to train a neural network model as a base to predict the POSIX I/O bandwidth of other clusters (CLAIX18 and Theta). With less than 1% of the data needed to build the base model, our experiment shows that our transfer learning workflow can predict the I/O bandwidth of another system with a mean absolute error better or equivalent to the state-of-the-art.

Workshop

Trigger Smart Data Saving Applied to CO2 Capture in Metal-Organic Frameworks

Data Analysis, Visualization, and Storage

Large Scale Systems

Performance Measurement, Modeling, and Tools

DescriptionFacing the need for carbon emission reduction, processes such as CO2 capture in nanoporous Metal-Organic Frameworks (MOFs) have emerged. However, such processes still need to be improved, by understanding the dynamic properties of CO2 molecules when confined in MOF nanopores. To do so, molecular dynamics (MD) simulations are run for several millions of iterations, enabling to accurately compute the CO2 residency time. Nevertheless, this dynamical parameter remains challenging to compute by standard post-processing approaches and may require terabytes of memory when data are saved after each iteration. To tackle this issue, we developed a trigger-based in situ approach that saves only the relevant data. We implement it by instrumenting the LAMMPS MD code with the SENSEI/Python in situ API. We show that this approach reduces the quantity of data saved by 4 orders of magnitude and can be up to 14% faster than traditional MD simulations without in situ processing.

Paper

TrivialSpy: Identifying Software Triviality via Fine-Grained and Dataflow-Based Value Profiling

Compilers

Performance Measurement, Modeling, and Tools

Performance Optimization

Programming Frameworks and System Software

DescriptionTrivial operations cause software inefficiencies that waste functional units and memory bandwidth for executing useless instructions. Although previous works have identified a significant amount of trivial operations in widely used programs, the proposed solutions only provide useful observations, other than actionable guidance to eliminate trivial operations for better performance. In this paper, we propose TrivialSpy - a fine-grained and dataflow-based value profiler to effectively identify software triviality with optimization potential estimation. With the help of dataflow analysis, TrivialSpy can detect software trivialities of heavy operation, trivial chain, and redundant backward slice. In addition, TrivialSpy can identify trivial breakpoints that combine multiple trivial conditions for more optimization opportunities. The evaluation results demonstrate TrivialSpy is capable of identifying software triviality in highly optimized programs. Based on the optimization guidance provided by TrivialSpy, we can achieve 52.09% performance speedup at maximum after eliminating trivial operations.

Birds of a Feather

Two Worlds Collide: Forging Sustainable Coupled HPC Simulation/Deep Learning Applications from Hardware to Algorithm

Artificial Intelligence/Machine Learning

XO/EX

DescriptionThis Birds of a Feather session, “Two Worlds Collide: Forging Sustainable Coupled HPC Simulation/Deep Learning Applications from Hardware to Algorithm,” continues a series started in 2021 with a theme of discussing and brainstorming solutions for a new paradigm in HPC: the coupling of simulation with machine learning for state-of-the-art research. In this installment, we focus on sustainability and assurance for coupled simulation and deep learning. We discuss the current state and needs for enabling integration of HPC simulation with modern deep learning stacks to provide transformative scientific discoveries while delivering productivity, portability, and correctness for safety and mission critical applications.

Posters

Research Posters

Two-Phase IO Enabling Large-Scale Performance Introspection

Performance Measurement, Modeling, and Tools

DescriptionNumerous sophisticated profiling and visualization tools have been developed to enable programmers to expose semantic information from their application components. However, effective and interactive exploration of the profiles of large-scale parallel programs remains a challenge due to the high I/O overheads of profiles and the difficulties in scaling downstream visualization tools. In this poster, we present a full-stack approach to a performance introspection framework that tackles key challenges in profiling and visualizing performance data at scale. Our novelty lies in a scalable and compact data model and a two-phase I/O system, which instill scalability into the profiler making it low overhead-- even at high process counts (< 5%). We then build a web-based, visual-analytic dashboard with linked views. Our profiling and visualization tools are both lightweight and easy-to-use, which strikes a balance between providing sophisticated features and operating quickly and efficiently at high process counts.

Workshop

Tydi-lang: A Language for Typed Streaming Hardware

Architecture and Networks

DescriptionTransferring composite data structures with variable-length fields often requires designing unique protocols, causing incompatibility issues and decreased collaboration among hardware developers, especially in the open-source community. Because the high-level meaning of a protocol is often lost in translation to low-level languages when a custom protocol needs to be designed, extra documentation is required, the interpretation of which introduces new opportunities for errors.

The Tydi specification (Tydi-spec) was proposed to address the issues by codifying the complex structures in a type and providing a standard protocol to transfer typed data among components. This paper presents Tydi-lang, a language that incorporates Tydi-spec for describing typed streams and offers templates for reusable components. An open-source compiler from Tydi-lang to Tydi intermediate representation (Tydi-IR) is implemented, and a Tydi-IR to VHDL compiler is utilized. Through Tydi-lang examples translating high-level SQL to VHDL, we demonstrate its efficiency in raising abstraction levels and reducing design effort.

Workshop

Uncertainty Quantification of Metal Additive Manufacturing Processing Conditions Through the Use of Exascale Computing

Performance Optimization

DescriptionMetal additive manufacturing is a disruptive manufacturing technology that opens the design space for parts outside those possible from traditional manufacturing methods. In order to accelerate industry and R&D needs to certify AM parts, the ExaAM project has developed a suite of exascale-ready computational tools to model the process-to-structure-to-properties relationship for additively manufactured metal components. One tool is a UQ pipeline to quantify the effect uncertainty in processing conditions has on local mechanical responses. We present an overview of this pipeline and its codes. Using ORNL’s exascale computer, Frontier, we utilize this pipeline to cross multiple length and time scales to predict local mechanical response of a location within a complex AM bridge part, AMB2018-01 produced by NIST as part of their 2018 AM-Bench test series. Our results are then compared to experimental mechanical tests of parts from the NIST build to quantify the error in the ExaAM UQ workflow.

Workshop

Uncertainty Quantification of Reduced-Precision Time Series in Turbulent Channel Flow

Performance Optimization

DescriptionWith increased computational power through the use of low-precision arithmetic, a relevant question is how lower precision affects simulation results, especially for chaotic systems where analytical round-off estimates are non-trivial to obtain. In this work, we consider how the uncertainty of the time series of a direct numerical simulation of turbulent channel flow at 𝑅𝑒𝜏 = 180 is affected when restricted to a reduced-precision representation. We utilize a non-overlapping batch means estimator and find that the mean statistics can, in this case, be obtained with significantly fewer mantissa bits than conventional IEEE-754 double precision, but that the mean flow is more sensitive in the middle of the channel than the boundary layer. This indicates that using lower precision in the boundary layer, where the majority of the computational work is located, may benefit significantly from low-precision floating point units found in upcoming computer hardware.

Workshop

Understanding Community Perspectives on HPC Skills and Training Pathways

Education

State of the Practice

DescriptionThe “Understanding the Skills and Pathways Behind Research Software Training” BoF session run at ISC’23 provided an opportunity to gather attendees interested in enhancing skills within the RSE community. This included looking at options for understanding and developing pathways that practitioners can follow to develop their skills and competencies in a structured manner from beginner to advanced level.

During the session a live, anonymous survey was conducted. Participants were asked several questions including their role in the training community and how easy they feel it is to find/access training content targeting different skill levels. They were also asked about challenges faced in accessing relevant content, combining it into a coherent pathway, and linking training content from different sources.

The goal of this lightning talk is to present findings, within the context of the community wide effort to make the training materials more FAIR - findable, accessible, interoperable, and reusable.

Workshop

Understanding Energy Performance of Containers Deployment on HPC-Based Post-Moore Platforms

DescriptionHPC platforms seek to ensure peak computing performance with minimal energy cost searching sustainability. Considering different cases of use and implementation, both post-Moore hardware elements and software deployment (as virtualization or containerization) are incorporated. However, as the number of devices proliferates, managing applications has become intricate, prompting the adoption of containerization methods for simplification. Then, understanding the performance of the deployment strategy to propose an adequate implementation and integration of HPC-based post-Moore architectures is important to guarantee efficiency. This study evaluates the containerization strategies on a cost-effective post-Moore device suitable for an HPC platform. Factors like ease of use, reproducibility, and compatibility examine the deployment methods. The evaluation process employs metrics establishment and stress tests to appraise application-specific aspects. The obtained insights are categorized to address deployment mechanisms, performance implications, execution duration, and energy consumption impacts.

Paper

Understanding the Effects of Permanent Faults in GPU’s Parallelism Management and Control Units

Accelerators

Architecture and Networks

Data Analysis, Visualization, and Storage

Fault Handling and Tolerance

DescriptionModern Graphics Processing Units (GPUs) demand life expectancy extended to many years, exposing the hardware to aging (i.e., permanent faults arising after the end-of-manufacturing test). Hence, techniques to assess permanent fault impacts in GPUs are strongly required, especially in safety-critical domains.

This paper presents a method to evaluate permanent faults in the GPU's scheduler and control units, together with the first figures to quantify these effects. We inject 5.83x10^5 permanent faults in the gate-level units of a GPU model. Then, we map the observed error categories as software errors by instrumenting 13 applications and two convolutional neural networks, injecting more than 1.65x10^5 permanent errors (1,000 errors per application), reducing evaluation times from several years to hundreds of hours. Our results highlight that faults in GPU parallelism management units impact software execution parameters. Moreover, errors in resource management or instructions codes hang code, while 45% of errors induce silent data corruption.

Panel

Understanding the Performance, Reproducibility, Validation, Portability, and Sustainability of Coupled HPC Simulation and Deep Learning Calculations

Artificial Intelligence/Machine Learning

Applications

Reproducibility

TUT

XO/EX

DescriptionRecent advances in deep learning (DL) for scientific computing have paved the way for a new type of integrated programming environment. This environment must support the seamless integration of simulation applications with deep learning frameworks using methods such as in-memory coupling and inference serving. Especially for HPC, this environment brings a slew of challenges, forcing developers to revisit decades of solved problems in scientific computing: kernel optimization, verification/validation strategies, building/porting practices. Interfacing HPC simulation codes with DL frameworks from industry—whose philosophies and strategies may differ from those within HPC—brings critical questions about how these two communities can work together to develop sustainable, integrated programming environments that are trustworthy, vetted, and portable, and where HPC communities can express requirements for scientific software and can track ownership. Discussions are needed about how to overcome these challenges: here, panelists from academia, national laboratories and industry will start a conversation, sharing perspectives and experiences.

Paper

Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters

Accelerators

Algorithms

Linear Algebra

DescriptionThis paper presents a unified framework for reducing communication costs of sparse triangular solvers (SpTRSV) on CPU and GPU clusters. The proposed framework builds upon a 3D communication-avoiding process layout that distributes a sparse triangular matrix into a 3D layout consisting of 2D grids. This work significantly reduces inter-process communication by replicating computation and using sparse allreduce operations across the 2D grids. This also allows for integration of a number of communication-optimized 2D SpTRSV algorithms including binary communication tree-based CPU algorithms and one-sided GPU communication (e.g., NVSHMEM)-based algorithms. With all these communication reduction schemes, the resulting SpTRSV exhibits significantly better scalability than existing works on leadership CPU and CPU clusters such as Cori, Perlmutter and Crusher.

Birds of a Feather

Unified Communication X (UCX) Community

Architecture and Networks

XO/EX

DescriptionIn order to exploit the capabilities of new HPC systems and to meet their demands in scalability, communication software needs to scale on millions of cores and support applications with adequate functionality. UCX is a collaboration between industry, national labs and academia that consolidates that provides a unified open-source framework.

The UCX project is managed by the UCF consortium (http://www.ucfconsortium.org/) and includes members from LANL, ANL, Ohio State University, AMD, ARM, IBM, NVIDIA, and more. The session will serves as the UCX community meeting, and will introduce the latest development to HPC developers and the broader user community.

Birds of a Feather

Unified Software Power Management for HPC System Stacks

Energy Efficiency

Middleware and System Software

Sustainability

XO/EX

DescriptionThis BoF will bring together academia, government research laboratories, and industry to discuss and contribute to the two active community-driven, vendor-neutral forums focusing on energy efficiency in HPC software stacks. For more than 7 years, these two complementary forums- HPC-PowerStack and PowerAPI - have led the efforts in identifying and building software solutions across the software stack.

This interactive BoF will enable the community to discuss ongoing challenges in designing cost-effective, cohesive, portable, and interoperable implementations of HPC software for monitoring and control of system efficiency. Attendees will also contribute toward brainstorming solutions for addressing ongoing exascale power challenges.

Paper

Unity ECC: Unified Memory Protection Against Bit and Chip Errors

Accelerators

Architecture and Networks

Data Analysis, Visualization, and Storage

Fault Handling and Tolerance

Best Student Paper Finalist

DescriptionDRAM vendors utilize On-Die Error Correction Codes (OD-ECC) to correct random bit errors internally. Meanwhile, system companies utilize Rank-Level ECC (RL-ECC) to protect data against chip errors. Separate protection increases the redundancy ratio to 32.8% in DDR5 and incurs significant performance penalties. This paper proposes a novel RL-ECC, Unity ECC, that can correct both single-chip and double-bit error patterns. Unity ECC corrects double-bit errors using unused syndromes of single-chip correction. Our evaluation shows that Unity ECC without OD-ECC can provide the same reliability level as Chipkill RL-ECC with OD-ECC. Moreover, it can significantly improve system performance and reduce DRAM energy and area by eliminating OD-ECC.

Posters

Research Posters

Unleashing CGRA Potential for HPC

XO/EX

DescriptionThis poster highlights our previous and future design-space exploration effort to optimize our CGRA architecture for HPC, i.e., intra-CGRA interconnect optimization, FMA and transcendental operation on CGRA, programmable buffer, systolic-array style execution on CGRA, predication support, and FPGA based emulation on actual HPC environment.

Panel

Unleashing the Power within Data Democratization: Needs, Challenges, and Opportunities

Applications

Reproducibility

DescriptionThe scientific community needs a data fabric that integrates data delivery and access to shared storage, networking, computing, and educational resources. Such a data fabric can potentially democratize data-driven scientific discovery across the growing data science community.

In this panel, we will discuss the needs, challenges, and opportunities of the data science community leveraging the existing cyberinfrastructures and software tools while strategizing on what is missing to connect an open network of institutions, including resource-disadvantaged institutions.

Students@SC

Unlock the Power of Negotiation Workshop

TUT

XO/EX

DescriptionAre you looking to maximize your success in the job interview process? Negotiation is a crucial skill to help you secure the job offer you desire. In this workshop, participants will learn the best negotiation practices, such as what to negotiate, when to negotiate, how to evaluate total compensation, and what to avoid during negotiation. Participants will have the opportunity to put their negotiation skills to practice.

Invited Talk

Unlocking Potential: The Role of HPC in Computational Medicine

Applications

Biology

Medicine

DescriptionThe utilization of vascular digital twins has gained significant traction in the field of medicine, holding immense potential for transforming healthcare practices. These advanced models enable the creation of patient-specific replicas of vascular systems, facilitating precise measurements of blood flow conditions. Vascular digital twins provide a non-invasive solution for assessing stenosis severity, guiding treatment decisions, and optimizing surgical planning. Medical professionals can enhance their expertise and refine approaches with precision and confidence by performing virtual surgery and evaluating interventions beforehand. However, the development and deployment of vascular digital twins pose notable challenges, particularly in terms of data size, time-to-solution, and computational cost. Constructing a realistic model of human blood flow entails complex mathematical and computational tasks, incorporating fluid dynamics, intricate vessel geometry, pulse-driven flow and pressure changes, and the behavior of red blood cells. Furthermore, the seamless integration of personalized models with streaming wearable data for holistic patient views and virtual reality interfaces for intuitive interaction by clinicians and researchers presents additional hurdles. In this presentation, I will discuss the role of high performance computing in advancing the fidelity and use of personalized computational models.

Tutorial

Unlocking the Potential of HPC in the Google Cloud with Open-Source Tools

Cloud Computing

Middleware and System Software

TUT

DescriptionCloud computing technologies have seen tremendous growth in recent years, with many organizations moving their HPC workloads to the cloud due to its flexibility in the organization and provisioning of HPC infrastructure. While such a diverse and flexible set of options brings additional degrees of freedom, they also bring a daunting set of hardware and software choices. Furthermore, the lines between traditional system administrator and application deployment can be blurred.

In this tutorial, we will provide a foundation to understand how to run HPC workloads in the cloud effectively and with minimal complexity. We start with a primer on cloud foundations and how they map to common HPC concepts, and then dive deeper into core HPC cloud components. We then introduce important HPC partners, discuss industry-specific solutions and present blueprints describing infrastructure, scheduler and applications.

Finally, we present the best practices to run HPC in the cloud and how to explore your options for the best configuration for price/performance.

This tutorial will use a combination of lectures and hands-on labs using Google Cloud, the open-source Google Cloud HPC Toolkit, Slurm, Spack, and other popular open-source HPC software to provide a balance of both theoretical and hands-on learning.

Workshop

Unlocking the Potential of Large Language Models for High-Performance Computing Code

Artificial Intelligence/Machine Learning

Software Engineering

DescriptionHigh-Performance Computing (HPC) has long been the driving force behind advancements in science, engineering, and beyond. Yet, realizing the full potential of HPC applications has often been hampered by the intricate nature of programming for the underlying parallel systems. In this keynote, we explore a transformative approach that bridges the gap between human ingenuity and computational power using the capabilities of large language models (LLMs).

Our research is an exploration of how cutting-edge LLMs can be tailored to the demanding domain of HPC, where computational speed and efficiency reign supreme. While LLMs have showcased remarkable proficiency in understanding and generating code, their training data primarily comes from general-purpose codebases. In stark contrast, HPC code involves intricate mathematical modeling, parallelism, and optimization, demanding customized adaptations.

That is why our journey for ‘HPC LLMs’ began with the collection of an extensive dataset, HPCorpus, that represents the culmination of HPC code in C, C++, and Fortran from diverse domains. Armed with this invaluable resource, we embarked on an ambitious mission to enhance the capabilities of language models in the realm of HPC. The creation of Tokompiler, a pioneering HPC-specific code tokenizer, marked a pivotal turning point. Tokompiler, designed to preprocess code for language models, introduced a revolutionary approach that harnessed abstract syntax trees (ASTs) into the source code itself and reshaped the way language models comprehend and generate code, resembling how compilers perceive our codes, not humans. Building upon this innovation, we undertook comprehensive pre-training efforts with CompCoder, adapting transformer-based language models to the intricacies of HPC. This journey has culminated in novel downstream tasks, including the generation of OpenMP and MPI code, where our models shine by transforming serial code into efficient parallel one. Together, these milestones represent a great leap forward in the convergence of AI and HPC from a different perspective, promising to redefine the landscape of computational science.

As we stand at the crossroads of AI and HPC, the possibilities are boundless. Our journey is merely the prologue, unveiling a multitude of untapped opportunities in HPC code comprehension, generation, and optimization. From refining domain-specific code to tackling complex simulations and accelerating scientific breakthroughs, the horizons are vast. The symbiotic partnership between LLMs and HPC promises to revolutionize how HPC practitioners write code. Looking ahead, we envision a future where LLMs for HPC become indispensable tools for researchers and developers in their quest for unprecedented speed, accuracy, and efficiency.

Workshop

Unraveling Diffusion in Fusion Plasma: A Case Study of In Situ Processing and Particle Sorting

Data Analysis, Visualization, and Storage

Large Scale Systems

Performance Measurement, Modeling, and Tools

DescriptionThis work explores a new use case of an in situ processing capability to study a certain diffusion process in magnetic confinement fusion. This diffusion process involves plasma particles that are likely to escape confinement, because such particles carry a significant amount of energy from the burning plasma to the diverter and damaging the diverter plate. This study requires in situ processing because of the fast changing nature of the particle diffusion process. However, the in situ processing approach is challenging because the amount of data to be retained for the diffusion calculations increases over time, unlike in other in situ processing cases where the amount of data to be processed is constant over time. Here we report our preliminary efforts to control the memory usage while ensuring the necessary analysis tasks are completed in a timely manner.

Posters

Research Posters

Unstructured Finite Element Models of Cardiac Electrophysiology Using a Deal.II-Based Library

XO/EX

DescriptionCardiovascular electrophysiology simulations often involve computationally expensive tasks due to the inherent multiphysics complexity of the problems. Additionally, the use of complex patient-specific geometries and biophysically-detailed ionic models adds to the system's complexity. To numerically solve such problems within reasonable timeframes, high-performance computing plays a crucial role. In this poster, we present a high-performance electrophysiology library specifically designed to address these demanding simulations. The library's routines support the use of linear, and quadratic tetrahedral elements. Moreover, our library offers a two-way coupling capability that enables interactions among multi-dimensional meshes. This important feature facilitates the simulation of electrical interactions between insulated regions of the heart, such as the atria and the ventricles. By enabling such coupling, the library aims to contribute to a more comprehensive understanding of the heart's electrophysiology and its intricate electrical behavior.

Birds of a Feather

Updates from the HPC Certification Forum

Education

XO/EX

DescriptionCreating and providing HPC training for practitioners with diverse backgrounds is challenging, and requires a multitude of educational resources covering different skills. However, the sheer volume does not guarantee discoverability or quality of the content. The main goal of the International HPC Certification program is to ease the provision and uptake of training by clearly categorizing, defining and eventually assessing the skills required to efficiently use HPC resources. The session aims to present the current status, discuss the developed processes, tools, and skills, and to ensure community involvement. Anyone interested in HPC education is invited to participate in the discussion.

Workshop

Using Azure Quantum Resource Estimator to Evaluate Performance of Quantum Algorithms

Quantum Computing

Software Engineering

DescriptionThe automatic resource estimation tools provided by Azure Quantum and Microsoft Quantum Development Kit is described and examples are given of obtaining resource estimates for fault tolerant implementations of several quantum algorithms. The AQ Resource Estimator tool uses the planar, quantum Instruction-Set Architecture as the logical abstraction level where the algorithm specification and the physical parameters of a chosen quantum hardware profile meet. More specifically, it enables the user to provide a high-level specification of an algorithm, which then gets automatically translated to the quantum ISA level. At the lower end of the stack, the tool enables the user to specify parameters such as the physical error rates, the time durations, the quantum error correction scheme, and the algorithmic error budget. Put together, the tool is thus able to calculate the physical resources that are needed to execute the specific quantum algorithm.

Workshop

Using Benford's Law to Identify Unusual Failure Regions

Fault Handling and Tolerance

Large Scale Systems

DescriptionFault tolerance remains a key challenge for current high performance computing systems. Effective and efficient scheduling of mitigation methods continues to be a critical issue in the face of dynamic and difficult-to-predict error rates found on many systems. Using failure data from the Astra supercomputer, we examine the efficacy of a simple method to determine if a sliding window of recent failures contains an unusual pattern of errors. Specifically, we investigate using Benford’s Law to predict the likelihood that the system is currently in a period of unusual failure occurrences. While still in its initial stages, this work provides critical analysis of failure status for extreme-scale systems and a simple form of prediction for determining when the scheduling of failure mitigation may be suboptimal and needs to be reevaluated due to the unusual pattern of errors that are occurring.

Tutorial

Using Containers to Accelerate HPC

Cloud Computing

Resource Management

Software Engineering

TUT

DescriptionWithin just the past few years, the use of containers has revolutionized the way in which industries and enterprises have developed and deployed computational software and distributed systems. The containerization model has gained traction within the HPC community as well with the promise of improved reliability, reproducibility, portability, and levels of customization that were not previously possible on supercomputers. This adoption has been enabled by a number of HPC Container runtimes that have emerged including Singularity, Shifter, Enroot, Charliecloud, and others.

This hands-on tutorial looks to train users on the usability of containers on HPC resources. We will provide a detailed background on Linux containers, along with introductory hands-on experience building a container image, sharing the container and running it on a HPC cluster. Furthermore, the tutorial will provide more advanced information on how to run MPI-based and GPU-enabled HPC applications, how to optimize I/O intensive workflows, and how to setup GUI enabled interactive sessions. Cutting-edge examples will include machine learning and bioinformatics. Users will leave the tutorial with a solid foundational understanding of how to utilize containers on HPC resources using Podman, Shifter, and Singularity, and in-depth knowledge to deploy custom containers on their own resources.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Using Deep Neural Networks to Classify Hot-Cold Data Storage

XO/EX

DescriptionThe Scientific Data and Computing Center (SDCC) at Brookhaven National Laboratory manages a data storage system with millions of files totaling petabytes of data. To optimize costs, they use a multi-tiered storage approach based on data temperature, storing infrequently accessed ("cold") data on cheaper technologies like Blu-ray disks or tape drives, and frequently accessed ("hot") data on faster but costlier mediums like Hard Disk Drives or Solid State Drives. Current data migration decisions rely on manual human judgment supported by simple algorithms not suitable for long-term predictions. To address this, our project aims to automate the process by training a deep neural network (DNN) on file metadata to predict data temperature upon upload. The model achieved promising initial results, with a 90.53% general accuracy in predicting data temperature. This automation could significantly improve the management and distribution of the vast research data generated at BNL.

Workshop

Using Mixed-Radix Decomposition to Enumerate Computational Resources of Deeply Hierarchical Architectures

Exascale

Message Passing

Programming Frameworks and System Software

DescriptionCurrent HPC architectures are deeply hierarchical (racks, nodes, sockets, NUMA domains, caches, ...), and the mapping of MPI processes to cores can significantly influence application performance. To study hierarchy effects on MPI application performance, we propose a procedure for expressing mappings by enumerating cores in the hierarchy in different orders. We explore two use cases: MPI rank reordering for applications using subcommunicators, and core selection for applications not using all cores on a node.

Results of micro-benchmarks executing collective operations in subcommunicators show a performance difference up to a factor 4 between the best and the worst rank orderings. By changing the rank orders, we observe a performance impact for the Splatt application. The evaluation of the strong scalability of a conjugate gradient benchmark shows that considering all hierarchy levels in the core selection policy can give better performance than using only options available with common MPI application launchers.

Workshop

Using MPI For Distributed Hyper-Parameter Optimization and Uncertainty Evaluation

Education

State of the Practice

DescriptionDeep Learning (DL) methods have recently dominated the fields of Machine Learning. Most DL models assume that the input data distribution is identical between testing and validation, though they often are not. For example, if we train a traffic sign classifier, the model might confidently, but incorrectly, classify a graffitied stop sign as a speed limit sign. Often ML provides high-confidence (softmax) output for out-of-distribution input that should have been classified as "I don't know". By adding the capability of propagating uncertainty to our results, the model can provide not just a single prediction, but a distribution over predictions that will allow the user to determine the model's reliability and whether it needs to be deferred to a human expert. Uncertainty estimation is computationally expensive; in this assignment, we will learn to accelerate the calculations using common distributed systems divide and conquer techniques.

Files given to students (Slides&Code ) (link:\url{https://drive.google.com/drive/folders/1KrxWlMZpoJzph0Y7VbZj_yYyACK-Jusl?usp=sharing})

Workshop

Using Umpire In-Situ for Improved Memory Performance

Data Analysis, Visualization, and Storage

Large Scale Systems

Performance Measurement, Modeling, and Tools

DescriptionBecause memory is a highly constrained resource, Umpire, a data and memory management API, was created at Lawrence Livermore National Laboratory (LLNL). Umpire provides memory pools which enable less expensive ways to allocate very large amounts of memory in HPC environments. Additionally, memory pools can be used when many small allocations are needed to avoid expensive calls to the underlying device-specific API. In-situ visualization is inherently
resource constrained, making Umpire’s memory management API a valuable tool for improving performance. Umpire is used in many simulation codes at LLNL that also rely on cutting-edge in-situ visualization libraries. This lightning talk discusses Umpire's advantages and use cases, including some examples of in-situ visualization applications which rely on Umpire to improve memory performance.

Workshop

Using Unity for Scientific Visualization as a Course-Based Undergraduate Research Experience

Education

State of the Practice

DescriptionWe have developed a series of course-based undergraduate research experiences for students integrated into curriculum centered around the use of 3D visualization. One project involves the creation and use of a volumetric renderer for hyperstack images, paired with a project in confocal microscopy. Students have developed and tested tools for confocal microscopy visualization across headset based and CAVE based VR platforms. Two applications of the tool are presented: a rendering of Drosophila primordial germ cells coupled with automated detection and counting, and a database in development of 3D renderings of pollen grains. Another project involves the development and testing of point cloud renderers. Student work has focused on performance testing and enhancement across a range of 2D and 3D hardware, including native Quest apps. Through the process, students are introduced to scientific visualization concepts, while gaining practical experience with programming, software engineering, graphics, shader programming, and cross-platform design.

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Utilizing Large Language Models for Disease Phenotyping in Obstructive Sleep Apnea

XO/EX

DescriptionObstructive sleep apnea (OSA) impacts millions, linking to severe complications yet understanding its influence on comorbidities lags. Complications can be avoided by using expensive continuous positive airway pressure (CPAP) machines, but physicians cannot identify those at risk. Large language models (LLMs) have recently made impressive advancements in sequence modeling, and clinical applications are quickly emerging. However, the medical relevance of pre-trained LLM latent spaces remains uncertain.

This study gauges 12 pre-trained clinical LLMs, clustering OSA-related phenotypes and comorbidities (atrial fibrillation, coronary artery disease, heart failure, hypertension, stroke, type 2 diabetes). Using 40 A100 GPUs on NERSC’s Perlmutter, document-level embeddings for 331,793 MIMIC-IV discharge reports were computed for each LLM. K-Means models were ranked by clustering entropy of phenotype classes, guiding model selection. The top models successfully subset patients with similar histories and outcomes. This work will support ongoing OSA research by identifying phenotypes and assist physicians by informing CPAP allocation.

Workshop

Value-Based Resource Management at SoC Scale

Accelerators

Edge Computing

Heterogeneous Computing

DescriptionValue-based resource management heuristics, which are traditionally deployed in heterogeneous HPC systems, maximize system productivity by assigning resources to each job based on its priority and estimated value gain relative to each job's completion time. We investigate the utility of value-based resource management at heterogeneous SoC scale and demonstrate its ability to make effective scheduling decisions for time-constrained jobs in oversubscribed systems where system resources are shared by multiple users and applications arrive dynamically. The proposed value-based resource management approach drops tasks that are estimated to result with lower-value gain dynamically with the aim of completing more number of high-value jobs with a scheduling decision time at 120𝜇s scale. The value-based resource management treats scheduling as a global optimization problem, therefore this study sets a path forward for deploying a unified value-based resource management on a system composed of front-end SoC-based edge devices and a back-end HPC system.

Paper

VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores

Artificial Intelligence/Machine Learning

Codesign

Performance Optimization

Programming Frameworks and System Software

DescriptionThe increasing success and scaling of Deep Learning models demands higher computational efficiency and power. Sparsification can lead to both smaller models as well as higher compute efficiency, and accelerated hardware is becoming available. However, exploiting it efficiently requires kernel implementations, pruning algorithms, and storage formats, to utilize hardware support of specialized sparse vector units. An example of those are the NVIDIA’s Sparse Tensor Cores(SPTCs), which promise a 2x speedup. However, SPTCs only support the 2:4 format, limiting achievable sparsity ratios to 50%. We present the V:N:M format, which enables the execution of arbitrary N:M ratios on SPTCs. To efficiently exploit the resulting format, we propose Spatha, a high-performance sparse-library for DL routines. We show that Spatha achieves up to 37x speedup over cuBLAS. We also demonstrate a second-order pruning technique that enables sparsification to high sparsity ratios with V:N:M and little to no loss in accuracy in modern transformers.

Workshop

Verifying Performance Guidelines for MPI Collectives at Scale

Modeling and Simulation

Performance Measurement, Modeling, and Tools

DescriptionMPI collective communication operations are crucial for high-performance computing, making the efficient implementation of collective algorithms essential for optimal application performance. While most MPI libraries provide several algorithms for a specific collective operation, each may work better in a specific scenario. Therefore, selecting the most suitable algorithm for each use case is important. However, even the best algorithm in a given MPI library’s set may deliver suboptimal performance.

Self-consistent MPI performance guidelines are general expectations that collectives must meet to be deemed performance-consistent. Specifically, a specialized collective call should not be slower than its less specialized counterparts. We introduce a tool for assessing the performance consistency of MPI collectives in a statistically sound manner. Through a case study, we demonstrate the current state of MPI performance consistency for three TOP500 machines.

Workshop

Vertical Scaling of Variational Multiscale Modeling for Fluid Dynamics: Successes, Challenges, and Opportunities

Accelerators

Edge Computing

Heterogeneous Computing

DescriptionWe investigate the vertical scaling of a mixed-precision variational multiscale method. In this method, the finescales are represented in reduced-precision floating-point format while the coarse scales are represented in double-precision floating-point format. We accelerate the solve of the finescale problem by shifting the solve of the finescale problem from the central processing unit to the graphical processing unit. We observe that this vertical scaling technique successfully accelerates the solve of the finescale problem by over 900x in some instances. However, we also note the observed acceleration is parameter-dependent and varies wildly based on the coarse scale and finescale polynomial degrees chosen for the variational multiscale method. Despite the demonstrated success of the present work, this case study highlights existing challenges when merging vertical and horizontal scaling techniques and motivates opportunities for future research on the topic.

Posters

Scientific Visualization & Data Analytics Showcase

Visualizing Megafires: How AI Can Be Used to Drive Wildfire Simulations with Better Predictive Skill

Data Analysis, Visualization, and Storage

HPC in Society

Modeling and Simulation

Visualization

XO/EX

DescriptionThe East Troublesome Wildfire was the fourth largest wildfire to date in Colorado history, igniting on October 14, 2020. Driven by low humidity and high winds, the wildfire spread to over 200,000 acres in nine days, with 87,000 of those acres being burnt in a single 24 hour period. Wildfire simulations and forecasts help decision-makers issue evacuation orders and inform response teams, but these simulations depend on accurate variable inputs to produce trustworthy results. These wildfire visualizations demonstrate new AI tools developed at the National Center for Atmospheric Research (NCAR), which are producing superior wildfire simulation outputs than have been available in the past.

Posters

Scientific Visualization & Data Analytics Showcase

Visualizing the Impact of the Asian Summer Monsoon on the Composition of the Upper Troposphere and Lower Stratosphere

Data Analysis, Visualization, and Storage

HPC in Society

Modeling and Simulation

Visualization

XO/EX

DescriptionWe present an explanatory-track visualization which utilizes multiple open-source graphics tools, including the C++ library OpenVDB and the 3D animation software Blender, to create a cinematic representation of simulation data generated in support of the Asian Summer Monsoon Chemical and Climate Impact Project (ACCLIP) campaign. After a brief summary of the project and data simulation, the process and techniques used to create the visualization are explained in detail.

Workshop

VSCuda: LLM-Based CUDA Extension for Visual Studio Code

Artificial Intelligence/Machine Learning

Software Engineering

DescriptionSince CUDA was introduced around 15 years ago, it has developed constantly to accommodate many different functionalities. For beginners in particular, it is already difficult remember the expected function parameters of commonly used CUDA features, let alone optimize their code by exploiting these new functionalities. To reduce the burden for CUDA programmers, we propose VSCuda - a Visual Studio Code extension for CUDA C/C++ which includes functionalities including but not limited to - CUDA syntax highlighting, Code Help for all CUDA Runtime API, Code completion for common CUDA functions and integrated code improvement suggestions from state-of-the-art large language models.

Workshop

WACCPD 2023 – Morning Break

Accelerators

Compilers

Heterogeneous Computing

Programming Frameworks and System Software

Runtime Systems

Workshop

Welcome

Algorithms

Heterogeneous Computing

Large Scale Systems

Early Career Program

Inclusivity

Welcome

Inclusivity

Workshop

Welcome

Distributed Computing

Security

Workshop

Welcome – Part I

Applications

Cloud Computing

Distributed Computing

Edge Computing

Large Scale Systems

Workshop

Welcome - Part II

Data Analysis, Visualization, and Storage

Large Scale Systems

Programming Frameworks and System Software

Reproducibility

Resource Management

Runtime Systems

Workshop

Performance Optimization

Workshop

Programming Frameworks and System Software

State of the Practice

Workshop

Exascale

Message Passing

Programming Frameworks and System Software

Workshop