Search
Organizations
Contributors
Presentations
Students@SC
TP
W
TUT
XO/EX
DescriptionThis workshop will explore the definitions of microaggressions, macroagressions, microaffirmations, and effective methods to recognize the impacts of these in the workplace. The workshop will consist of understanding and defining biases and reviewing subtle remarks that may seem commonplace but could be harmful. The objective is for participants to gain an understanding of what microaggressions are, how harmful they can be, and how to combat them to promote a positive culture (which can be with microaffirmations).
Workshop
Middleware and System Software
Programming Frameworks and System Software
Runtime Systems
W
DescriptionThe complexity of node architectures in supercomputers increases as we cross milestones on the way toward exascale and beyond. Increasing levels of parallelism in multi- and many-core chips and emerging heterogeneity of computational resources coupled with energy and memory constraints force a reevaluation of our approaches towards operating systems and runtime environments.
The International Workshop on Runtime and Operating Systems for Supercomputers (ROSS) provides a forum for researchers to exchange ideas and discuss research questions that are relevant to upcoming supercomputers and cloud environments for high-performance computing. In addition to typical workshop publications, we encourage novel and possibly immature ideas, provided that they are interesting and on-topic. Well-argued position papers are also welcome.
The International Workshop on Runtime and Operating Systems for Supercomputers (ROSS) provides a forum for researchers to exchange ideas and discuss research questions that are relevant to upcoming supercomputers and cloud environments for high-performance computing. In addition to typical workshop publications, we encourage novel and possibly immature ideas, provided that they are interesting and on-topic. Well-argued position papers are also welcome.
Workshop
Education
State of the Practice
W
DescriptionThis paper describes an assignment in the Chapel programming language for creating a 1D heat equation solver. Two methods are used to solve the problem, exposing a variety of parallel programming concepts. The first portion of the assignment uses high-level parallel constructs, namely Chapel's forall loop and Block distribution, to create a simple distributed-memory solver. Here, students are asked to think about what it means for an array to be split across the memory in multiple compute nodes while relying on the language to handle the details of communication and synchronization. The second portion of the assignment uses low-level parallelism, like barriers and explicit communication. Here, the goal is to reduce overhead, while introducing students to the ideas of explicit communication and synchronization. In both parts, students are provided with a non-distributed version of the solver and are asked to create a modified version that runs across multiple compute nodes.
Birds of a Feather
Performance Measurement, Modeling, and Tools
TP
XO/EX
DescriptionData intensive supercomputer applications are increasingly important workloads, especially for “Big Data” problems, but are ill suited for most of today’s computing platforms (at any scale!). The Graph500 list has grown to over 357 entries and has demonstrated the challenges of even simple analytics. The new SSSP kernel introduced at SC17 has increased the benchmark’s overall difficulty. This BoF will unveil the latest Graph500 lists, provide in-depth analysis of the kernels and machines, and enhance the new energy metrics the Green Graph500. It will offer a forum for community and provide a rallying point for data intensive supercomputing problems.
Paper
Exascale
Large Scale Systems
State of the Practice
TP
DescriptionHPL-MxP is an emerging high performance benchmark used to measure the mixed-precision computing capability of leading supercomputers. This work present our efforts on the new Sunway that linearly scales the benchmark to over 40 million cores, sustains an overall mixed-precision performance exceeding 5 ExaFlop/s, and achieves over 85% of peak performance, which is the highest efficiency reached among all heterogeneous systems on the HPL-MxP list. The optimizations of our HPL-MxP implementation include: (1)a Two-Direction Look-Ahead and Overlap algorithm that enables overlaps of all communications with computation; (2)a multi-level process-mapping and communication-scheduling method that uses the network as best as possible while maintaining conflict-free algorithm-flow; and (3)a CG-Fusion computing framework that eliminates up to 60% of inter-chip communications and removes the memory access bottleneck while serving both computation and communication simultaneously. This work could also provide useful insights for tuning cutting-edge applications on Sunway supercomputers as well as other heterogeneous supercomputers.
Paper
Accelerators
Applications
Modeling and Simulation
TP
DescriptionA high-scalable and fully optimized earthquake model is presented based on the latest Sunway supercomputer. Contributions include:
1) the curvilinear grid finite-difference method (CGFDM) and flexible model applying perfectly matched layer (PML) and enabling more accurate and realistic terrain descriptions;
2) a hybrid and non-uniform domain decomposition scheme that efficiently maps the model across different levels of the computing system; and
3) sophisticated optimizations that largely alleviate or even eliminate bottlenecks in memory, communication, etc., obtaining a speedup of over 140x.
Combining all innovations, the design fully exploits the hardware potential of all aspects and enables us to perform the largest CGFDM-based earthquake simulation ever reported (69.7 PFlops using over 39 million cores).
Based on our design, the Turkey earthquakes (February 6, 2023), and the Ridgecrest earthquake (July 4, 2019), are successfully simulated with a maximum resolution of 12-m. Precise hazard evaluations for the hazardous reduction of earthquake-stricken areas are also conducted.
1) the curvilinear grid finite-difference method (CGFDM) and flexible model applying perfectly matched layer (PML) and enabling more accurate and realistic terrain descriptions;
2) a hybrid and non-uniform domain decomposition scheme that efficiently maps the model across different levels of the computing system; and
3) sophisticated optimizations that largely alleviate or even eliminate bottlenecks in memory, communication, etc., obtaining a speedup of over 140x.
Combining all innovations, the design fully exploits the hardware potential of all aspects and enables us to perform the largest CGFDM-based earthquake simulation ever reported (69.7 PFlops using over 39 million cores).
Based on our design, the Turkey earthquakes (February 6, 2023), and the Ridgecrest earthquake (July 4, 2019), are successfully simulated with a maximum resolution of 12-m. Precise hazard evaluations for the hazardous reduction of earthquake-stricken areas are also conducted.
Exhibits
Flash Session
TP
XO/EX
DescriptionThis session will discuss the latest generation of Nokia’s PSE (Photonic Switch Engine which provides up to 1.2Tb/s per wavelength and helps close the gap to Shannon’s limit.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionThe complexity and parameters of mainstream large models are increasing rapidly. For example, the increasingly popular large language models (e.g., ChatGPT) have billions of parameters. While this has led to performance improvements, the performance gains for simple tasks may be unacceptable for the additional cost. We apply residual networks of three different depths and evaluate them extensively on the MedMNIST pneumonia dataset. Experimental results show that smaller models can achieve satisfactory performance at significantly lower costs than larger models.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionThe field of Optimal Control under Partial Differential Equations (PDE) constraints is rapidly changing under the influence of Deep Learning and the accompanying automatic differentiation libraries. Novel techniques like Physics-Informed Neural Networks (PINNs) and Differentiable Programming (DP) are to be contrasted with established numerical schemes like Direct-Adjoint Looping (DAL). We present a comprehensive comparison of DAL, PINN, and DP using a general-purpose mesh-free differentiable PDE solver based on Radial Basis Functions. Under Laplace and Navier-Stokes equations, we found DP to be extremely effective as it produces the most accurate gradients; thriving even when DAL fails and PINNs struggle. Additionally, we provide a detailed benchmark highlighting the limited conditions under which any of those methods can be efficiently used. Our work provides a guide to Optimal Control practitioners and connects them further to the Deep Learning community.
Birds of a Feather
Quantum Computing
TP
XO/EX
DescriptionIntegrating quantum computing (QC) test beds into scientific computing environments presents challenges in software interfaces and system familiarity. High-performance computing (HPC) centers are adopting this task but selecting suitable test bed technologies is complex due to numerous providers with varying maturity levels and the associated risk of single vendor systems.
A component-based approach is promising but faces challenges with the lack of standardized benchmarks, and the need for device-specific calibrations. This discussion addresses the challenge of component-based approaches and explores unifying access to diverse QC technologies, leveraging HPC for optimization, and fulfilling researcher needs.
A component-based approach is promising but faces challenges with the lack of standardized benchmarks, and the need for device-specific calibrations. This discussion addresses the challenge of component-based approaches and explores unifying access to diverse QC technologies, leveraging HPC for optimization, and fulfilling researcher needs.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Programming Frameworks and System Software
Reproducibility
Resource Management
Runtime Systems
W
DescriptionThis paper presents an adaptive continuum synchronization method for data science pipelines deployed on edge-fog-cloud infrastructures. In a diagnostic phase, a model, based on the Bernoulli principle, is used as an analogy to create a global representation of bottlenecks in a pipeline. In a supervision phase, a watchman/sentinel cooperative system monitors and captures the throughput of the pipeline stages to create a bottleneck-stage scheme. In a rectification phase, this system produces replicas of stages identified as bottlenecks to mitigate the workload congestion using implicit parallelism and load balancing algorithms. This method is automatically and transparently invoked to produce in runtime a steady continuum dataflow. To test our proposal, we conducted a case study about the processing of medical and satellite data on fog-cloud infrastructures. The evaluation revealed that this method creates, without characterizing workloads nor knowing infrastructure details, continuum dataflows, which yield a competitive performance with solutions in the state-of-the-art.
Exhibitor Forum
Exascale
Programming Frameworks and System Software
Quantum Computing
TP
XO/EX
DescriptionTake a deep dive into the latest developments in NVIDIA software for high performance computing applications, including a comprehensive look at what’s new in programming models, compilers, libraries, and tools. We'll cover topics of interest to HPC developers, targeting traditional HPC modeling and simulation, quantum computing, HPC+AI, scientific visualization, and high-performance data analytics.
Workshop
Programming Frameworks and System Software
W
DescriptionInsights about applications and user environments can help HPC center staff make data-driven decisions about cluster operations. In this paper, we present a fast and responsive web-based visualization framework for analyzing HPC application usage. By leveraging XALT, a powerful tool for tracking application and library usage, we collected tens of millions of data points on a national supercomputer. The portable visualization framework created with Plotly Dash can be easily launched as a container and accessed from a web browser. The presented visualizations take a deep dive into the XALT data, analyzing application use, compiler usage, library usage, and even user-specific usage. Our analysis codes can distinguish between centrally installed applications and user-installed applications and can generate plots based on different metrics (no of jobs or cpu-hours). Initial insights gained from this visualization framework have helped our support staff identify several goals for improving the software stack and helping users proactively.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionIn this work, we explore how to replicate the behavior of undocumented hardware units -- in this case, NVIDIA's Tensor Cores -- and reason about them.
While prior work has employed manual testing to identify hardware behavior, we show that SMT can be used to generate inputs that can discriminate between different hardware implementation choices. We argue that SMTLIB, the language specification for SMT solvers, is well suited for exposing hardware implementations.
Using our method, we create a formal specification of the tensor cores on NVIDIA's Volta architecture. We confirm many of the findings of previous studies on tensor cores, but also identify two discrepancies: we find that the hardware does not use IEEE-754 round-to-zero for accumulation and that the 5-term accumulator requires 3 extra bits for carry out since it does not normalize intermediate sums.
The work will be presented in person using the poster as a visual aid.
While prior work has employed manual testing to identify hardware behavior, we show that SMT can be used to generate inputs that can discriminate between different hardware implementation choices. We argue that SMTLIB, the language specification for SMT solvers, is well suited for exposing hardware implementations.
Using our method, we create a formal specification of the tensor cores on NVIDIA's Volta architecture. We confirm many of the findings of previous studies on tensor cores, but also identify two discrepancies: we find that the hardware does not use IEEE-754 round-to-zero for accumulation and that the 5-term accumulator requires 3 extra bits for carry out since it does not normalize intermediate sums.
The work will be presented in person using the poster as a visual aid.
Workshop
Distributed Computing
Middleware and System Software
Runtime Systems
W
DescriptionModern tasking models define applications in a fine-grained manner that necessitates lower overhead per segment of computation. While previous work has seen implementations of hardware support for tasking models, many lack the support required by heterogeneity and fall short of expanding memory interfaces for data-centric needs and memory utilization. In this paper, we propose and implement a hardware support scheme of the sequential codelet model (SCM). The hardware support makes it possible to demonstrate SCM’s potential advantage on heterogeneous workloads and capability of supporting the expanding software memory interface. The gem5 implementation of the Sequential Codelet Model functions as a foundation to demonstrate the benefits offered by the SCM program execution model by moving hardware support closer to program semantics. We compare the overhead with DARTS, a software implementation of the Codelet Model that has been shown to be useful for fine-grained execution, and show a 20x reduction in overhead.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Performance Measurement, Modeling, and Tools
W
DescriptionAutomated computational steering is used to automatically guide simulations toward productive states by combining data analysis with predefined control flow paths. Interactive computational steering achieves a similar goal, but by relying on manual human intervention instead. Existing in situ libraries are capable of fulfilling some computational steering use cases, but not all of them. This paper presents a general purpose interface for instrumenting existing simulation codes with interactive computational steering capabilities. Common use cases are presented, summarized from informal interviews held with 7 research scientists that use large-scale simulations in their work. Preliminary support for bidirectional communication via simulation callbacks and shell commands has been implemented in Ascent, a software library which provides simulations with in situ analysis and visualization infrastructure. Finally, a proof of concept instrumentation is provided, demonstrating that the proposed interface is sufficiently flexible to enable any interactive computational steering use case within Ascent-instrumented simulations.
Paper
Accelerators
Algorithms
Graph Algorithms and Frameworks
TP
DescriptionDetecting strongly connected components (SCCs) is an important step in various graph computations. The fastest GPU and CPU implementations from the literature work well on graphs where most of the vertices belong to a single SCC and the vertex degrees follow a power-law distribution. However, these algorithms can be slow on the mesh graphs used in certain radiative transfer simulations, which have a nearly constant vertex degree and can have significant variability in the number and size of SCCs. We introduce ECL-SCC, an SCC detection algorithm that addresses these shortcomings. Our approach is GPU-friendly and employs innovative techniques such as maximum ID propagation and edge removal. On an A100 GPU, ECL-SCC performs on par with the fastest prior GPU code on power-law graphs and outperforms it by 7.8x on mesh graphs. Moreover, ECL-SCC running on the GPU outperforms fast parallel CPU code by three orders of magnitude on meshes.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionThe field of in silico cellular modeling has made notable strides in number of cells that can be simultaneously modeled. While computational capabilities have grown exponentially, I/O performance has lagged behind. To address this issue, we present an in-transit approach to enable in situ visualization and analysis of large-scale fluid-structure-interaction models on leadership-class systems. We delineate the proposed framework and demonstrate the feasibility of this approach by measuring overhead introduced. The proposed framework provides a valuable tool for both at-scale debugging and enabling scientific discovery, which would be difficult to achieve otherwise.
Posters
Research Posters
TP
XO/EX
DescriptionIn traditional deep learning workflows, AI applications (producers) train DNN models offline using fixed datasets, while inference serving systems (consumers) load the trained models for offering real-time inference queries. In practice, AI applications often operate in a dynamic environment where data is constantly changing. Compared to offline learning, Continuous learning frequently (re)-trains models to adapt to the ever-changing data. This demands regular deployment of the DNN models, increasing the model update frequency between producers and consumers. Typically, producers and consumers are connected via model repositories like PFS, which may result in high model update latency due to I/O bottleneck of PFS. To address this, our work introduces a high-performance I/O framework that speeds up model updates between producers and consumers. It employs a cache-aware model handler to minimize the latency and an intelligent performance predictor to maintain a balance between training and inference performance.
Paper
Accelerators
Algorithms
Graph Algorithms and Frameworks
TP
DescriptionFinding a minimum spanning tree (MST) is a fundamental graph algorithm with applications in many fields. This paper presents ECL-MST, a fast MST implementation designed specifically for GPUs. ECL-MST is based on a parallelization approach that unifies Kruskal's and Borůvka's algorithm and incorporates new and existing optimizations from the literature, including implicit path compression and edge-centric operation. On two test systems, it outperforms leading GPU and CPU codes from the literature on all of our 17 input graphs from various domains. On a Titan V GPU, ECL-MST is, on average, 4.6 times faster than the next fastest code, and on an RTX 3080 Ti GPU, it is 4.5 times faster. On both systems, ECL-MST running on the GPU is roughly 30 times faster than the fastest parallel CPU code.
Posters
Research Posters
TP
XO/EX
DescriptionFor numerical simulations, linear system with large sparse matrix with high condition number needs to be solved. LDU-factorization with pivoting strategy provides robust solver for such system. Computational complexity of the factorization solver is high and cannot be reduced in framework of the direct solver, but by using lower precision arithmetic, computational cost and memory usage could be reduced. LDU-factorization uses recursive generation of the Schur complement matrix, but generation of the last one can be replaced by an iterative method. Here decomposition of the whole matrix into a union of the moderate and hard parts during factorization with threshold pivoting plays a key role. A new algorithm uses factorization in lower precision as a preconditioner for iterative solver in higher precision to generate the last Schur complement. True mixed precision arithmetic is used in forward/backward substitution for preconditioner with factorized matrix in lower precision and RHS vectors in higher.
Posters
Scientific Visualization & Data Analytics Showcase
Data Analysis, Visualization, and Storage
Modeling and Simulation
Visualization
TP
XO/EX
DescriptionThe Advanced Visualization Lab at the NCSA created a cinematic scientific visualization showing a flight through the Milky Way galaxy, to the galactic center where stars are orbiting around a supermassive black hole. The tour summarizes results from Andrea Ghez's Galactic Center Group: their study of the motions of stars around the Milky Way's central black hole reveals a rich and surprising environment, with hot young stars (coded as purple) where few were expected to be, many orbiting in a common plane; a paucity of cooler old stars (yellow); a population of unexpected "G-object" dusty stars (red); and an eclipsing binary star (teal). The black hole itself, shrouded in mystery, is seen only as a tiny faint twinkling radio source. But the movement of these nearby stars, especially the S0-2 "hero" (pale blue ellipse), probe the black hole's gravity, exposing its massive presence.
Posters
Research Posters
TP
XO/EX
DescriptionPointing out genetic mutations is pivotal to enable clinicians to prescribe personalized therapies to their patients. Genome Analysis ToolKit's HaplotypeCaller, relying on the Pair Hidden Markov Model (PairHMM) algorithm, is one of the most used applications to identify such variants. However, the PairHMM represents the bottleneck for this tool. Deploying such an algorithm on hardware accelerators represents a valuable solution. Nevertheless, State-of-the-Art designs do not have the flexibility to support the length variability of the input sequences and are not usable in real-life applicative scenarios. For these reasons, this work presents a GPU accelerator for the PairHMM capable of supporting sequences of any length, thanks to a dynamic memory swap methodology, overcoming the limitation of literature solutions. Our accelerator achieves an 8154× speedup over the software baseline, surpassing the most-performant State-of-the-Art design up to 1.6×.
Birds of a Feather
Cloud Computing
Distributed Computing
TP
XO/EX
DescriptionWe are building a National Science Data Fabric (NSDF) that introduces a novel trans-disciplinary approach for integrated data delivery and access to shared storage, networking, computing, and educational resources. Such a data fabric can democratize data-driven scientific discovery across the growing data science community. In this BoF, we want to engage the data science community to discuss the challenges and opportunities of the NSDF project and other similar efforts to connect an open network of institutions, including resource-disadvantaged institutions, and develop a federated testbed configurable for individual and shared scientific use.
Workshop
Algorithms
Applications
Architecture and Networks
W
DescriptionVector processors have become essential to high-performance computing in scientific and engineering applications, especially in numerical calculations that leverage data parallelism. With escalating computational demands, the efficient execution of Sparse GEneral Matrix-matrix Multiplication (SpGEMM) on vector processors has become crucial. However, it brings challenges for vector processors due to its complex data structures and irregular memory access patterns.
We present a new method designed to perform SpGEMM on vector processors, inspired by Iterative Row Merging. The proposed method hierarchically merges rows by utilizing long vector instructions. We evaluate the proposed method against other methods across 27 sparse matrices. The results indicate that the proposed method outperforms other methods for 22 out of the 27 sparse matrices, reaching up to 31.9 times better performance in the best case. Furthermore, we compare with the GPU implementation that inspired our proposed method, using the same generation of GPUs.
We present a new method designed to perform SpGEMM on vector processors, inspired by Iterative Row Merging. The proposed method hierarchically merges rows by utilizing long vector instructions. We evaluate the proposed method against other methods across 27 sparse matrices. The results indicate that the proposed method outperforms other methods for 22 out of the 27 sparse matrices, reaching up to 31.9 times better performance in the best case. Furthermore, we compare with the GPU implementation that inspired our proposed method, using the same generation of GPUs.
Workshop
Algorithms
Applications
Architecture and Networks
W
DescriptionIn dynamic networks, where continuous topological changes are prevalent, it becomes paramount to find and update different graph properties without the computational burden of recalculating from the ground up. However finding or updating a multi-objective shortest path (MOSP) in such a network is challenging, as it involves simultaneously optimizing multiple (conflicting) objectives.
In light of this, we focus on shortest path search and proposes parallel algorithms tailored specifically for large incremental graphs. We first present an efficient algorithm that updates the single-objective shortest path (SOSP) whenever a new set of edges are introduced. Leveraging this SOSP update algorithm, we also devise a novel heuristic approach to adaptively update a MOSP in large networks. Empirical evaluations on both real and synthetic incremental networks with shared memory implementations attest to the scalability and efficacy of the proposed algorithms.
In light of this, we focus on shortest path search and proposes parallel algorithms tailored specifically for large incremental graphs. We first present an efficient algorithm that updates the single-objective shortest path (SOSP) whenever a new set of edges are introduced. Leveraging this SOSP update algorithm, we also devise a novel heuristic approach to adaptively update a MOSP in large networks. Empirical evaluations on both real and synthetic incremental networks with shared memory implementations attest to the scalability and efficacy of the proposed algorithms.
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionWe present a simple performance model to estimate the qubit-count and runtime associated with large-scale error-corrected quantum computations. Our estimates extrapolate current usage costs of quantum computers and show that computing the ground state of the 2D Hubbard model, which is widely believed to be an early candidate for practical quantum advantage, could start at a million dollars. Our model shows a clear cost advantage of up to four orders of magnitude for quantum processors based on superconducting technology compared to ion trap devices. Our analysis shows that usage costs, while substantial, will not necessarily block the road to practical quantum advantage. Furthermore, the combined effects of algorithmic improvements, more efficient error correction codes, and R&D cost amortization are likely to lead to orders of magnitude reductions in cost.
Workshop
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionThe first generation of exascale systems will include a variety of machine architectures, featuring GPUs from multiple vendors. As a result, many developers are interested in adopting portable programming models to avoid maintaining multiple versions of their code. It is necessary to document experiences with such programming models to assist developers in understanding the advantages and disadvantages of different approaches.
To this end, this work evaluates the performance portability of a SYCL implementation of a large-scale cosmology application (CRK-HACC) running on GPUs from three different vendors: AMD, Intel, and NVIDIA. We detail the process of migrating the original code from CUDA to SYCL and show that specializing kernels for specific targets can greatly improve performance portability without significantly impacting programmer productivity. The SYCL version of CRK-HACC achieves a performance portability of 0.96 with a code divergence of almost 0, demonstrating that SYCL is a viable programming model for performance-portable applications.
To this end, this work evaluates the performance portability of a SYCL implementation of a large-scale cosmology application (CRK-HACC) running on GPUs from three different vendors: AMD, Intel, and NVIDIA. We detail the process of migrating the original code from CUDA to SYCL and show that specializing kernels for specific targets can greatly improve performance portability without significantly impacting programmer productivity. The SYCL version of CRK-HACC achieves a performance portability of 0.96 with a code divergence of almost 0, demonstrating that SYCL is a viable programming model for performance-portable applications.
Posters
Research Posters
TP
XO/EX
DescriptionA software tool, called SPEL, has been developed to port and optimize and the ultrahigh-resolution ELM (uELM) code onto GPUs within a functional unit test framework. To promote the widespread adoption of this approach for community-based uELM development, this poster presents a portable software environment that enables efficient development of the uELM code on GPUs. The standalone software environment, which utilizes Docker, contains all the necessary code, libraries, and system software required for uELM development using SPEL. The process involved in this study includes identifying a Docker image that supports GPU, configuring and simulating ELM at the site level, capturing reference solutions, testing uELM functional units, and generating and optimizing code that is compatible with GPUs. The effectiveness of this methodology is demonstrated through a case study.
Paper
Cloud Computing
Distributed Computing
Data Movement and Memory
Performance Measurement, Modeling, and Tools
TP
DescriptionMemory disaggregation has recently been adopted in major data centers to improve resource utilization, driven by cost and sustainability. Meanwhile, studies on large-scale HPC facilities have also highlighted memory under-utilization. A promising and non-disruptive option for memory disaggregation is rack-scale memory pooling, where node-local memory is supplemented by shared memory pools. This work outlines the prospects and requirements for adoption and clarifies several misconceptions. We propose a quantitative method for dissecting application requirements on the memory system in three levels, moving from general, to multi-tier memory, and then to memory pooling. We also provide tools to facilitate the quantitative approach. We evaluated a set of representative HPC workloads on an emulated platform. Our results show that interference in memory pooling has varied application impact, depending on access ratio and arithmetic intensity. Finally, our method is applied in two case studies to show benefits at both the application and system level.
Keynote
TP
W
TUT
XO/EX
DescriptionDr. Hakeem Oluseyi grew up in some of the roughest neighborhoods in the country. As a result, he spent a lot of time inside, reading encyclopedias and watching PBS nature shows. At a young age, he discovered a love of science and space that was inspired by his role model, Albert Einstein. Throughout his childhood and into young adulthood, he was repeatedly faced with circumstances that would make most people give up—a lack of supervision at home, attending his state’s lowest rated school, falling in with the wrong crowd, and failing physics exams when he ultimately made his way to Stanford. But Hakeem never gave up.
Today, as a world-renowned astrophysicist and the former Space Science Education Lead at NASA, Hakeem inspires audiences around the world to chase impossible dreams, fight for what they want, refuse to listen to naysayers, and reach out and lend a hand up to those around them. Hilarious, honest, and inspiring, Hakeem wows audiences with a look at his mind-bending scientific research while motivating them with his personal life story.
Today, as a world-renowned astrophysicist and the former Space Science Education Lead at NASA, Hakeem inspires audiences around the world to chase impossible dreams, fight for what they want, refuse to listen to naysayers, and reach out and lend a hand up to those around them. Hilarious, honest, and inspiring, Hakeem wows audiences with a look at his mind-bending scientific research while motivating them with his personal life story.
Workshop
Quantum Computing
Software Engineering
W
DescriptionPractical applications of quantum computing are currently limited by the number of qubits that can be set with reasonable fidelities for each system. Therefore, a distributed quantum computing system with multiple quantum computers coherently connected is highly demanding. To realize the internode communication of quantum information, the software interface, Quantum Message Passing Interface (QMPI), leveraging the framework built for classical MPI but taking advantage of quantum teleportation to communicate between different quantum nodes was proposed. In this project, we develop the QMPI with point-to-point and collective operations in Qiskit and characterize its performance by demonstrating the application implementations. Moreover, we developed a new technique for optimizing collective communication of the distributed quantum programs with Multi-Controlled Toffoli gates. This technique beats the state-of-the-art in terms of fidelity and the number of remote EPR pairs consumed in both simulations and experiments.
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionHPC systems employ a scheduling technique called “backfilling”, wherein low-priority jobs are scheduled earlier to use the available resources that are waiting for the pending high-priority jobs. Backfilling relies on job runtime to calculate the start time of the ready-to-schedule jobs and avoid delaying them. It is a common belief that better estimations of job runtime will lead to better backfilling and more effective scheduling. However, our experiments show a different conclusion: there is a missing trade-off between prediction accuracy and backfilling opportunities. To learn how to achieve the best trade-off, we believe reinforcement learning (RL) can be effectively leveraged. Based on this idea, we designed RLBackfilling, a reinforcement learning based backfilling algorithm. Our evaluation results show up to 17x better scheduling performance compared to EASY backfilling using user-provided job runtime and 4.7x better performance comparing with EASY using the ideal predicted job runtime (the actual job runtime).
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionHigh Performance Computing (HPC) systems are essential for various scientific fields, and effective job scheduling is crucial for their performance. Traditional backfilling techniques, such as EASY-backfilling, rely on user-submitted runtime estimates, which can be inaccurate and lead to suboptimal scheduling. This poster presents RL-Backfiller, a novel reinforcement learning (RL) based approach to improve HPC job scheduling. Our method incorporates RL to make better backfilling decisions, independent of user-submitted runtime estimates. We trained RL-Backfiller on the synthetic Lublin-256 workload and tested it on the real SDSC-SP2 1998 workload. We show how RLBackfilling can learn effective backfilling strategies and outperform traditional EASY-backfilling and other heuristic combinations via trial-and-error on existing job traces. Our evaluation results show up to 17x better scheduling performance (based on average bounded job slowdown) compared to EASY-backfilling
Workshop
Exascale
Message Passing
Programming Frameworks and System Software
W
DescriptionDistributed scientific applications run on a complex stack of soft- ware and network technologies. Each layer has configuration options for tuning performance. These can range from protocol thresh- olds to algorithmic changes for collectives. Micro-benchmarks are a common methodology to evaluate the communication stack and are relatively easy to tune. However they aren’t representative of application behavior. Proxy applications, however, offer a simplified, but realistic, representation of the main computational and communicative methods in scientific programs. Since these proxy applications contain realistic message passing patterns, the correlation between micro-benchmarks and proxy application performance is not obvious. We present a study of statistically analyzing the impacts of tuning. Our results show how tuned micro-benchmark performance correlates with tuned proxy application performance.
Workshop
Applications
Cloud Computing
Distributed Computing
Edge Computing
Large Scale Systems
W
DescriptionAn entire ecosystem of methodologies and tools revolves around scientific workflow management. They cover crucial non-functional requirements that standard workflow models fail to target, such as interactive execution, energy efficiency, performance portability, Big Data management, and intelligent orchestration in the Computing Continuum. Characterizing and monitoring this ecosystem is crucial to develop an informed view of current and future research directions. This work conducts a systematic mapping study of the Italian workflow research community, collecting and analyzing 25 tools and 10 applications from several scientific domains in the context of the "National Research Centre for HPC, Big Data, and Quantum Computing'' (ICSC). The study aims to outline the main current research directions and determine how they address the critical needs of modern scientific applications. The findings highlight a variegated research ecosystem of tools, with a prominent interest in advanced workflow orchestration and still immature but promising efforts toward energy efficiency.
Workshop
Data Analysis, Visualization, and Storage
Data Movement and Memory
W
Posters
Research Posters
TP
XO/EX
DescriptionTriangle counting is a cornerstone operation in large graph analytics. It has been a challenging problem historically, owing to the irregular and dynamic nature of the algorithm, which not only inhibits compile-time optimizations, but also requires runtime optimizations such as message aggregation and load-imbalance mitigation. Popular triangle counting algorithms are either inherently slow, fail to take advantage of available vectorization in modern processors, or involve sparse matrix operations. With its support for fine-grained asynchronous messages, the Partitioned Global Address Space (PGAS) with the Actor model has been identified to be efficient for irregular applications. However, few triangle counting implementations have been optimally implemented on top of PGAS Actor runtimes. To address the above mentioned challenges, we propose a set-intersection-based implementation of a distributed triangle counting algorithm atop the PGAS Actor runtime. Evaluation of our approach on the PACE Phoenix cluster and the Perlmutter supercomputer shows encouraging results.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionGPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. To address this issue, traditional approaches integrate lossy compression directly into GPU-aware collectives, which still suffer from serious issues such as underutilized GPU devices and uncontrolled data distortion.
In this poster, we propose GPU-LCC, a general framework that designs and optimizes GPU-aware, compression-enabled collectives with well-controlled error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our GPU-LCC-accelerated collective computation (Allreduce), can outperform NCCL as well as Cray MPI by up to 4.5X and 20.2X, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.
In this poster, we propose GPU-LCC, a general framework that designs and optimizes GPU-aware, compression-enabled collectives with well-controlled error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our GPU-LCC-accelerated collective computation (Allreduce), can outperform NCCL as well as Cray MPI by up to 4.5X and 20.2X, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.
Paper
Cloud Computing
Distributed Computing
Data Movement and Memory
Performance Measurement, Modeling, and Tools
TP
DescriptionAdvances in networks, accelerators, and cloud services encourage programmers to reconsider where to compute---such as when fast networks make it cost-effective to compute on remote accelerators despite added latency. Workflow and cloud-hosted serverless computing frameworks can manage multi-step computations spanning federated collections of cloud, high-performance computing (HPC), and edge systems, but passing data among computational steps via cloud storage can incur high costs. Here, we overcome this obstacle with a new programming paradigm that decouples control flow from data flow by extending the pass-by-reference model to distributed applications. We describe ProxyStore, a system that implements this paradigm by providing object proxies that act as wide-area object references with just-in-time resolution. This proxy model enables data producers to communicate data unilaterally, transparently, and efficiently to both local and remote consumers. We demonstrate the benefits of this model with synthetic benchmarks and real-world scientific applications, running across various computing platforms.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionIn recent years, we have seen an un-precedented growth of data in our daily lives ranging from health data from an Apple Watch, financial stock price data, volatile crypto-currency data, to diagnostic data of nuclear/rocket simulations. The increase in high-precision, high-sample-rate time-series data is a challenge to existing database technologies. We have developed a novel technique that utilizes sparse-file support to achieve O(1) time complexity in create, read, update, and delete (CRUD) operations while supporting time granularity down to 1-second. We designed and implemented XStore to be lightweight and offer high performance without the need to maintain an index of the time-series data. We have conducted a detailed evaluation between XStore and existing best-of-breed systems such as MongoDB using synthetic data spanning 20 years, with second granularity, totaling over 5 billion datapoints. Through empirical experiments against MongoDB, XStore achieves 2.5X better latency and delivers up to 3X improvement in throughput.
Exhibitor Forum
Artificial Intelligence/Machine Learning
Architecture and Networks
Hardware Technologies
TP
XO/EX
DescriptionHPC not only performs complex calculations at high speed but also processes large amount of data. HPC Systems separates compute node and storage node to effectively process them. All computation is performed on compute node, and all data is stored in storage node. In order to perform data analytics, compute node has to read large amount of data from storage node because simulation output data is large. Compute nodes must have enough memory to hold extremely large data sets but also bandwidth from storage can become a bottleneck as well. However, the actual data required for analytics is only a small part of the total data. One solution to solve this problem is computational storage. Since computational storage transfer only results to compute node by processing where data resides, it can reduce data movement and increase performance. SK hynix is researching computational storage technologies with Los Alamos National Laboratory. We propose Object based Computational Storage (OCS) as a new computational storage platform for data analytics in HPC. OCS has not only high scalability but also data-aware characteristics. Data-aware characteristics enable OCS to perform analytics independently without help from compute nodes. We intend to leverage the Apache analytics ecosystem, including Arrow and Substrait to enhance that ecosystem with the advantages computing near storage enables. Systems that use Arrow can transfer query results using a common transfer format, and Substrait provides a standard and open representation of query plans enabling pushdown of query portions to computational storage. SK hynix’s key technology for OCS is Object based Computational Storage Array(OCSA) used as a backend storage. With OCSA, OCS will provide flexible query pushdown and analytics acceleration as well as less software overhead. This talk will introduce the OCS architecture and discuss why we propose OCS as future direction for computational storage in HPC.
Exhibits
Flash Session
TP
XO/EX
DescriptionJoin us as we delve into real-world customer experiences with data ingestion and discover how a high-performance, highly scalable SMB server accelerates and eases the process. Learn how SMB can help you address complex data ingestion challenges, and how Fusion File Share by Tuxera enhances the efficiency of the process.
Workshop
Applications
Cloud Computing
Distributed Computing
Edge Computing
Large Scale Systems
W
DescriptionEarthquake early warning systems use synthetic data from simulation frameworks like MudPy to train models for predicting the magnitudes of large earthquakes. MudPy, although powerful, has limitations: a lengthy simulation time to generate the required data, lack of user-friendliness, and no platform for discovering and sharing its data. We introduce FakeQuakes DAGMan Workflow (FDW), which utilizes Open Science Grid (OSG) for parallel computations to accelerate and streamline MudPy simulations. FDW significantly reduces runtime and increases throughput compared to a single-machine setup. Using FDW, we also explore partitioned parallel HTCondor DAGMan workflows to enhance OSG efficiency. Additionally, we investigate leveraging cyberinfrastructure, such as Virtual Data Collaboratory (VDC), for enhancing MudPy and OSG. Specifically, we simulate using Cloud bursting policies to enforce FDW job-offloading to VDC during OSG peak demand, addressing shared resource issues and user goals; we also discuss VDC’s value in facilitating a platform for broad access to MudPy products.
Workshop
Algorithms
Applications
Architecture and Networks
W
DescriptionDeep Neural Network guided Monte-Carlo Tree Search (DNN-MCTS) is a powerful class of AI algorithms. The DNN operations are highly parallelizable, but tree search operations are sequential and often become the system bottleneck. Existing MCTS parallel schemes on CPU platforms either exploit data parallelism but sacrifice memory access latency, or take advantage of local cache for low-latency accesses but constrain the search to a single thread. This work analyzes the tradeoff of these parallel schemes, and proposes an adaptive parallel scheme that optimally chooses the MCTS component's parallel scheme on the CPU. Additionally, an efficient method for searching the optimal communication batch size when the CPU interfaces with DNN operations on an accelerator(GPU) is proposed. Using a DNN-MCTS algorithm on board game benchmarks, we show that our work is able to adaptively generate the best-performing parallel implementation, leading to a range of 1.5-3 times speedup compared with the baseline methods.
Exhibits
Flash Session
TP
XO/EX
DescriptionIn the modern business landscape, AI-driven initiatives are inhibited by an overly complicated data management ecosystem. Organizations are struggling to integrate various databases, distributed object stores, filesystems, and divergent data migration techniques. Learn about DDN's next generation approach to resolve the complexities of diverse infrastructures and unlock AI-driven digital transformation.
Exhibits
Flash Session
TP
XO/EX
DescriptionIn the modern business landscape, AI-driven initiatives are inhibited by an overly complicated data management ecosystem. Organizations are struggling to integrate various databases, distributed object stores, filesystems, and divergent data migration techniques. Learn about DDN's next generation approach to resolve the complexities of diverse infrastructures and unlock AI-driven digital transformation.
Exhibits
Flash Session
TP
XO/EX
DescriptionIn the modern business landscape, AI-driven initiatives are inhibited by an overly complicated data management ecosystem. Organizations are struggling to integrate various databases, distributed object stores, filesystems, and divergent data migration techniques. Learn about DDN's next generation approach to resolve the complexities of diverse infrastructures and unlock AI-driven digital transformation.
Workshop
Accelerators
Codesign
Heterogeneous Computing
Task Parallelism
W
DescriptionHyperparameter Optimization (HPO) of Neural Networks is a computationally expensive procedure, that has the potential to benefit from the use of novel accelerator capabilities. This paper investigates the performance of three popular HPO algorithms in terms of the achieved speed-up and model accuracy, utilizing early stopping, Bayesian, and genetic optimization approaches, in combination with mixed precision functionalities on NVIDIA A100 GPUs with Tensor Cores. The benchmarks are performed on 64 GPUs in parallel on three datasets: two from the vision and one from the CFD domain. The results show that, depending on the algorithm, larger speed-ups can be achieved for mixed precision compared to full precision HPO if the checkpoint frequency is kept low. In addition to the reduced runtime, also small gains in generalization performance on the test set are observed.
Workshop
Applications
Data Movement and Memory
Heterogeneous Computing
I/O and File Systems
Large Scale Systems
Middleware and System Software
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionThe relatively slower data transfer speeds resulting in I/O bottlenecks in scientific simulations is one of the critical challenges in exascale computing. Simulations generate large data and analysis applications consume this data to provide time-critical insights. The limited size and high power consumption of Dynamic Random Access Memory (DRAM) capacity leaves slow storage devices as the primary option for large-scale data transfers. Non-volatile memory (NVM) devices such as Intel Optane bridges the gap between storage and volatile memory by providing DRAM-comparable performance and persistence. We present PQueue, a data transfer library for in situ analysis of simulation output using persistent memory. PQueue leverages NVM and provides an API that resembles high-level parallel I/O libraries such as PnetCDF to enable seamless transition for application developers. We achieved a maximum of 7X improvement in write times and a maximum of 10X improvement in read times as compared to PnetCDF.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionWe leverage physics-embedded differentiable graph network simulators (GNS) to accelerate particulate and fluid simulations to solve forward and inverse problems. GNS represents the domain as a graph with particles as nodes and learned interactions as edges, improving generalization to new environments. GNS achieves over 165x speedup for granular flow prediction compared to parallel CPU simulations. We propose a novel hybrid GNS/Material Point Method to accelerate forward simulations by minimizing error on a surrogate model, achieving 24x speedup. The differentiable GNS enables solving inverse problems through automatic differentiation, identifying material parameters that result in target runout distances. We demonstrate solving inverse problems by iteratively updating the friction angle by computing the gradient of a loss function based on the final and target runouts, thereby identifying the friction angle that matches the observed runout. The physics-embedded and differentiable simulators open an exciting paradigm for AI-accelerated design, control, and optimization.
Exhibitor Forum
Artificial Intelligence/Machine Learning
Architecture and Networks
Hardware Technologies
TP
XO/EX
DescriptionNVIDIA Grace Hopper Superchips are a scale-up architecture ideal for scientific computing workflows involving CPUs and GPUs. Building on a decade of GPU acceleration, Grace-Hopper realizes NVIDIA NVLink C2C, a 900 GB/s interconnect between the Grace CPU and the Hopper H100 GPU. C2C enables coherent memory at 7x the bandwidth of PCIe across Hopper’s 96GB HBM3 and Grace’s up to 480GB LPDDR5X. This removes the conceptual CPU/GPU memory divide and lowers barriers for scientists accelerating their applications with ever faster GPUs, e.g., H100 delivering up to 67 FP64 teraflops and 4 TB/s memory bandwidth. With more application code executing on GPUs, workload performance becomes increasingly susceptible to non-GPU limiters like data movement and CPU performance (Amdahl’s Law). C2C and the Grace CPU, ideal for single-thread or multi-core CPU workloads, restore the required balance . Grace combines 72 Arm Neoverse-V2 cores with NVIDIA Scalable Coherency Fabric, a distributed cache and mesh fabric with 3.2 TB/s bi-section bandwidth. This high bandwidth mesh enables one NUMA node for all 72 CPU cores, simplifying multi-core programming. Each core implements a 512-bit SVE2 SIMD pipeline for a total CPU FP64 theoretical peak of 7.1 teraflops. When combined with the up to 500 GB/s memory bandwidth of the LPDDR5X DRAM, Grace delivers twice the performance-per-Watt of conventional x86-64 CPUs. This session presents HPC and AI workload performance results with a technical deep-dive into the specific features of Grace-Hopper that accelerate each workload. We discuss how Grace-Hopper's distinctive coupling of the CPU/GPU hardware and the accompanying software stack create a platform which increases developer productivity, accelerates existing applications, and facilitates new standard programming models in C++, Fortran, and Python. Attendees will gain a deeper understanding of how to extract the performance offered by Grace-Hopper and realize the potential of this innovative, energy-efficient platform for science and industry.
Birds of a Feather
Data Analysis, Visualization, and Storage
TP
XO/EX
DescriptionStorage IO is becoming more of a bottleneck, especially for a new generation of AI-based workloads that are accelerated by GPUs. This session will provide a brief overview of key trends, available solutions presented as lightning talks, and illustrative application performance gains in this space. The majority of the session will engage in an open, forward-looking discussion with the gathered community on promising areas for investigation. Presenters will include those from academia and industry with new and challenging applications, storage partners with characterization, and innovators with new solutions in GPU-initiated storage and greater security. Join us for an exciting exchange!
Workshop
State of the Practice
W
DescriptionThe existing HPC I/O stack struggles with the growing demands of HPC scientific workloads. To start with the latency bottleneck, there is a deeply layered kernel hierarchy to translate HPC I/O requests to the actual storage operations. This layered architecture adds a significant overhead along the entire I/O request path. Measurements have shown that it takes between 18,000 and 20,000 instructions to send and receive a single fundamental 4KB I/O request. Our novel hardware/software framework, named DeLiBA, aims to bridge this gap by facilitating the development of software components within the HPC I/O stack in user space, rather than the kernel space, and leverages a proven 16 nanometer (nm) FPGA framework to quickly deploy the FPGA-based HPC I/O accelerators. Our initial results achieve a 10% increase in throughput and demonstrates up to 2.3 times the I/O operations per second compared to conventional methods.
Exhibits
Flash Session
TP
XO/EX
DescriptionIn the world of high-performance computing, networks serve as vital conduits for data transmission, security, and application delivery. However, the ever-evolving nature of network traffic demands adaptive solutions. This presentation describes the challenges in optimizing and evolving network application performance and the unique role FPGAs can play in accelerating next-generation HPC networks.
Workshop
Accelerators
Edge Computing
Heterogeneous Computing
W
DescriptionHeterogeneous Intellectual Property (IP) hardware acceleration engines have emerged as a viable path forward to improving performance in the waning of Moore’s Law and Dennard scaling. In this study, we design, prototype, and evaluate the HPC-specialized ZHW floating point compression accelerator as a resource on a System on Chip (SoC). Our full hardware/software implementation and evaluation reveal inefficiencies at the system level that significantly throttle the potential speedup of the ZHW accelerator. By optimizing data movement between CPU, memory, and accelerator, 6.9X is possible compared to a RISC-V64 core, and 2.9X over a Mac M1 ARM core.
Birds of a Feather
Distributed Computing
State of the Practice
TP
XO/EX
DescriptionThe ACCESS Resource providers (RP) will give an overview of the available resources and their unique characteristics. Those resources are open to a broad audience of computational researchers. Individuals can apply for allocations by submitting a request to ACCESS. Once this request is approved they can exchange their awarded service units for resources at one or several of the providers (e.g. node hours, GPU hours, storage).
The presentations will highlight the variety of resources and will be followed by a discussion with the community, allowing the audience to directly interact with the RPs.
Visit https://app.meet.ps/attendee/fcqctplo to submit questions beforehand.
The presentations will highlight the variety of resources and will be followed by a discussion with the community, allowing the audience to directly interact with the RPs.
Visit https://app.meet.ps/attendee/fcqctplo to submit questions beforehand.
Workshop
Accelerators
Codesign
Heterogeneous Computing
Task Parallelism
W
DescriptionThe accurate and efficient determination of hydrologic connectivity has garnered significant attention from both academic and industrial sectors due to its critical implications for environmental management. While recent studies have leveraged the spatial characteristics of hydrologic features, the use of elevation models for identifying drainage paths can be influenced by flow barriers. To address these challenges, our focus in this study is on detecting drainage crossings through the application of advanced convolutional neural networks (CNNs). In pursuit of this goal, we use neural architecture search to automatically explore CNN models for identifying drainage crossings. Our approach not only attains high accuracy (over 97% for average precision) in object detection but also excels in efficiently inferring correct drainage crossings within a remarkably short time frame (0.268 ms). Furthermore, we perform a detailed profiling of our approach on GPU systems to analyze performance bottlenecks.
Workshop
Energy Efficiency
Green Computing
Sustainability
W
DescriptionSustainability in HPC is a major challenge not only for HPC centers and their users, but also for society. A lot of effort went into reducing the energy consumption of systems, but most efforts propose solutions targeting CPUs. As HPC systems shift more to GPU-centric architectures, simulation codes increasingly adopt GPU-programming models, leading to an urgent need to increase the energy-efficiency of GPU-enabled codes. However, studies for reducing the energy consumption of large-scale simulations executing on CPUs and GPUs have received insufficient attention.
In this work, we enable accurate energy measurements using an open-source toolkit across CPU+GPU architectures. We use this approach in SPH-EXA, an open-source GPU-centric astrophysical and cosmological simulation framework showing that with code instrumentation, users can accurately measure energy consumption of their application, beyond the data provided by HPC systems. The accurate energy data provide significant insights to users for conducting energy-aware computational experiments and code development.
In this work, we enable accurate energy measurements using an open-source toolkit across CPU+GPU architectures. We use this approach in SPH-EXA, an open-source GPU-centric astrophysical and cosmological simulation framework showing that with code instrumentation, users can accurately measure energy consumption of their application, beyond the data provided by HPC systems. The accurate energy data provide significant insights to users for conducting energy-aware computational experiments and code development.
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionPerformance variability in complex computer systems is a major challenge for accurate benchmarking and performance characterization, especially for tightly-coupled large-scale high-performance computing systems. Point summaries of performance may be both uninformative, if they do not capture the full richness of its behavior, and inaccurate, if they are derived from an inadequate sample set of measurements. Determining the correct sample size requires balancing tradeoffs of computation, methodology, and statistical power.
We treat the performance distribution as the primary target of the performance evaluation, from which all other metrics can be derived. We propose and evaluate a meta-heuristic that dynamically characterizes the performance distribution, determining when enough samples have been collected to approximate the true distribution. Compared to fixed stopping criteria, this adaptive method can be more efficient in resource use and more accurate. Importantly, it requires no advance assumptions about the system under test or its performance characteristics.
We treat the performance distribution as the primary target of the performance evaluation, from which all other metrics can be derived. We propose and evaluate a meta-heuristic that dynamically characterizes the performance distribution, determining when enough samples have been collected to approximate the true distribution. Compared to fixed stopping criteria, this adaptive method can be more efficient in resource use and more accurate. Importantly, it requires no advance assumptions about the system under test or its performance characteristics.
Paper
Accelerators
Algorithms
Graph Algorithms and Frameworks
TP
DescriptionGlobal ocean data assimilation is a crucial technique to estimate the actual oceanic state by combining numerical model outcomes and observation data, which is widely used in climate research. Due to the imbalanced distribution of observation data in global ocean, the parallel efficiency of recent methods suffers from workload imbalance. When massive GPUs are applied for global ocean data assimilation, the workload imbalance becomes more severe, resulting in poor scalability. In this work, we propose a novel adaptive workload-balance scheduling strategy, assimilation, which successfully estimates the total workload prior to execution and ensures a balanced workload assignment. Further, we design a parallel dynamic programming approach to accelerate the schedule decision, and develop a factored dataflow to exploit the parallel potential of GPUs. Evaluation demonstrates that our algorithm outperforms the state-of-the-art method by up to 9.1x speedup. This work is the first to scale global ocean data assimilation to 4,000 GPUs.
Workshop
Applications
Software Engineering
W
DescriptionData race detection tools should find data races not only in development builds of applications, but also in optimized production builds. An architecture-dependent optimization includes vectorization of the code. At the moment, DataRaceBench does not contain microkernels that test for data races in vectorized code. The few codes with SIMD directives are too simple, so that compilers tend to refuse vectorizing the loop. We carefully created new microkernels with and without data race that a tool will only detect if vector instructions are considered in the analysis. The new microkernels cover different vectorized memory access instructions. We used the new microkernels to verify the support for vectorized memory accesses in Intel Inspector and LLVM ThreadSanitizer.
While Intel Inspector could detect all data races in the new microkernels, ThreadSanitizer could not find the data races when the code is vectorized.
While Intel Inspector could detect all data races in the new microkernels, ThreadSanitizer could not find the data races when the code is vectorized.
Workshop
Education
State of the Practice
Sustainability
W
DescriptionThis lightning talk will highlight how several aspects of sustainability can frame the programming themes for a senior-level parallel computing class. We used the shallow water equation as a theme in our assignments from serial C & MPI, through OpenMP/PThreads to CUDA. By framing the problem sets in the setting of sustainability, both in terms of power usage /performance as well as framing and motivating the problem we are solving (shallow water equation) in terms of sustainability /environmental impact, our goal is to help stimulate the students to really get excited about our field and “spread the word”.
The inspiration for the core idea of this work came from attending the CDER 2022 PDC training workshop. It led to an ongoing related miniproject sponsored by Norway´s national Excited Centre of Excellent IT Education.
The inspiration for the core idea of this work came from attending the CDER 2022 PDC training workshop. It led to an ongoing related miniproject sponsored by Norway´s national Excited Centre of Excellent IT Education.
Workshop
Artificial Intelligence/Machine Learning
Graph Algorithms and Frameworks
W
DescriptionAdvancements in reinforcement learning (RL) via deep neural networks have enabled their application to a variety of real-world problems. However, these applications often suffer from long training times. While attempts to distribute training have been successful in controlled scenarios, they face challenges in heterogeneous-capacity, unstable, and privacy critical environments. This work applies concepts from federated learning (FL) to distributed RL, specifically addressing the stale gradient problem. A deterministic framework for asynchronous federated RL is utilized to explore dynamic methods for handling stale gradient updates in the Arcade Learning Environment. Experimental results from applying these methods to two Atari-2600 games demonstrate a relative speedup of up to 95% compared to plain A3C in large and unstable federations.
Tutorial
Data Analysis, Visualization, and Storage
I/O and File Systems
Large Scale Systems
Performance Measurement, Modeling, and Tools
TUT
DescriptionAs concurrency and complexity continue to increase on high-end machines, storage I/O performance is rapidly becoming a fundamental challenge to scientific discovery. At the exascale, online analysis will become a dominant form of data analytics, and thus scalable in situ workflows will become critical, along with high performance I/O to storage. The many components of a workflow running simultaneously pose another challenge of evaluating and improving the performance of these workflows. Therefore, performance data collection needs to be an integral part of the entire workflow.
In this tutorial, we present ADIOS-2 which allows for building in situ and file-based data processing workflows for extreme scale systems, including interactive, on-demand, in situ visualization of the data, and including performance profiling of the entire workflow. Half of this tutorial will be hands-on sessions, where we provide access to the software, and build together a complete MiniApp with in situ analytics and performance analysis that users can run on their laptop and supercomputers at large scale. We will show how ADIOS-2 is fully integrated into three popular visualization and performance tools: Jupyter Notebook, ParaView and TAU, creating a software ecosystem for in situ processing of both performance and scientific data.
In this tutorial, we present ADIOS-2 which allows for building in situ and file-based data processing workflows for extreme scale systems, including interactive, on-demand, in situ visualization of the data, and including performance profiling of the entire workflow. Half of this tutorial will be hands-on sessions, where we provide access to the software, and build together a complete MiniApp with in situ analytics and performance analysis that users can run on their laptop and supercomputers at large scale. We will show how ADIOS-2 is fully integrated into three popular visualization and performance tools: Jupyter Notebook, ParaView and TAU, creating a software ecosystem for in situ processing of both performance and scientific data.
Paper
Accelerators
Data Analysis, Visualization, and Storage
Data Compression
TP
DescriptionSZ is a lossy floating-point data compressor that excels in compression ratio and throughput for high-performance computing (HPC), time series databases, and deep learning applications. However, SZ performs poorly for small chunks and has slow decompression. We pinpoint the Huffman tree in the quantization factor encoder as the bottleneck of SZ. In this paper, we propose ADT-FSE, a new quantization factor encoder for SZ. Based on the Gaussian distribution of quantization factors, we design an adaptive data transcoding (ADT) scheme to map quantization factors to codes for better compressibility, and then use finite state entropy (FSE) to compress the codes. Experiments show that ADT-FSE improves the quantization factor compression ratio, compression and decompression throughput by up to 5x, 2x and 8x, respectively, over the original SZ Huffman encoder. On average, SZ_ADT is over 2x faster than ZFP in decompression.
Birds of a Feather
Architecture and Networks
TP
XO/EX
DescriptionTestbeds play a vital role in assessing the readiness of novel architectures for upcoming supercomputers for the exascale and post-exascale era. These testbeds also act as co-design hubs, enabling the collection of application operational requirements, while identifying critical gaps that need to be addressed for an architecture to become viable for HPC. Various research centers are actively deploying testbeds, and our aim is to build a community that facilitates the sharing of information, encouraging collaboration and understanding of the available evaluation resources. This BoF will facilitate the exchange of best practices, including testbed design, benchmarking, system evaluation, and availability.
Tutorial
Algorithms
Message Passing
Performance Optimization
TUT
DescriptionThe vast majority of production parallel scientific applications today use MPI and run successfully on the largest systems in the world. Parallel system architectures are evolving to include complex, heterogeneous nodes comprising general-purpose CPUs as well as accelerators such as GPUs. At the same time, the MPI standard itself is evolving to address the needs and challenges of future extreme-scale platforms as well as applications. This tutorial will cover several advanced features of MPI that can help users program modern systems effectively. Using code examples based on scenarios found in real applications, we will cover several topics including efficient ways of doing 2D and 3D stencil computation, derived datatypes, one-sided communication, hybrid programming (MPI + threads, shared memory, GPUs), topologies and topology mapping, neighborhood and nonblocking collectives, and some of the new performance-oriented features in MPI-4. Attendees will leave the tutorial with an understanding of how to use these advanced features of MPI and guidelines on how they might perform on different platforms and architectures.
Tutorial
Accelerators
Heterogeneous Computing
Performance Optimization
TUT
DescriptionWith the increasing prevalence of multicore processors, shared-memory programming models are essential. OpenMP is a popular, portable, widely supported, and easy-to-use shared-memory model. Developers usually find OpenMP easy to learn. However, they are often disappointed with the performance and scalability of the resulting code. This disappointment stems not from shortcomings of OpenMP, but rather from the lack of depth with which it is employed. Our “Advanced OpenMP Programming” tutorial addresses this critical need by exploring the implications of possible OpenMP parallelization strategies, both in terms of correctness and performance.
We assume attendees understand basic parallelization concepts and know the fundamentals of OpenMP. We focus on performance aspects, such as data and thread locality on NUMA architectures, false sharing, and exploitation of vector units. All topics are accompanied by extensive case studies, and we discuss the corresponding language features in-depth. Continuing the emphasis of this successful tutorial series, we focus solely on performance programming for multi-core architectures. Throughout all topics, we present the recent additions of OpenMP 5.2 and comment on developments targeting OpenMP 6.0.
We assume attendees understand basic parallelization concepts and know the fundamentals of OpenMP. We focus on performance aspects, such as data and thread locality on NUMA architectures, false sharing, and exploitation of vector units. All topics are accompanied by extensive case studies, and we discuss the corresponding language features in-depth. Continuing the emphasis of this successful tutorial series, we focus solely on performance programming for multi-core architectures. Throughout all topics, we present the recent additions of OpenMP 5.2 and comment on developments targeting OpenMP 6.0.
Birds of a Feather
Architecture and Networks
TP
XO/EX
DescriptionFPGAs have gone from niche components to being a central part of many data centers worldwide. The last year has seen tremendous advances in FPGA programmability and technology, especially in the shift to reconfigurable architectures that are heterogeneous and/or based on CGRAs or other AI engines. This BoF has two parts. The first is a series of lightning talks presenting advances in tools, technologies, and use-cases for these emerging architectures. The second part of the BoF will be a general discussion driven by the interests of the attendees, potentially including additional topics.
Workshop
Data Analysis, Visualization, and Storage
Data Movement and Memory
W
DescriptionReal-world HPC workloads, including simulations and machine learning, place significant strain on storage infrastructure due to their data-dependency, exacerbated by the diverse storage options in modern HPC environments, leading to I/O bottlenecks. To mitigate these bottlenecks, past analysis methods relied on manual evaluations and tools like Darshan for I/O trace collection, often necessitating expert involvement and substantial time commitments. In response to the imperative for automation due to the time-intensive nature of manual analysis and the pressing need to effectively mitigate I/O bottlenecks, analysis tools were developed. According to our findings, these tools, while providing automation, can still benefit from transitioning from heuristics-based approaches to a more data-driven decision-making. To address this, we propose a data-driven approach that leverages multi-perspective views.
Workshop
Algorithms
Heterogeneous Computing
Large Scale Systems
W
DescriptionAs supercomputers become larger with powerful Graphics Processing Unit (GPU), traditional direct eigensolvers struggle to keep up with the hardware evolution and scale efficiently due to communication and synchronization demands. Subspace eigensolvers, like the Chebyshev Accelerated Subspace Eigensolver (ChASE), have a simpler structure and can overcome communication and synchronization bottlenecks. ChASE is a modern subspace eigensolver that uses Chebyshev polynomials to accelerate the computation of extremal eigenpairs of dense Hermitian eigenproblems. In this work we show how we have modified ChASE by rethinking its memory layout, introducing a novel parallelization scheme, switching to a more performing communication-avoiding algorithm for one of its inner module, and substituting MPI library by vendor-optimized NCCL library. The resulting library can tackle dense problems with size up to N=O(10^6), and scales effortlessly up to the full 900 nodes---each one powered by 4xA100 NVIDIA GPUs---of the JUWELS Booster hosted at the Jülich Supercomputing Centre.
Birds of a Feather
Applications
TP
XO/EX
DescriptionAgriculture worldwide is facing massive challenges in production, distribution, pollution reduction, and food security and waste: less than 40% of any crop is actually marketed. The farm, the oldest human-engineered system, produces the vast majority of human sustenance and consumes the majority of global freshwater. Its efficient operation is of vital importance -particularly when supply chains are disrupted by wars and pandemics. This BoF will discuss how novel supercomputing technologies and related distributed heterogeneous systems at scale could empower the primary sector and, as a result, stop operating in a needlessly fragile and inefficient way.
Exhibitor Forum
Accelerators
Artificial Intelligence/Machine Learning
TP
XO/EX
DescriptionGenerative AI is quickly becoming mainstream and everyone wants a slice of the AI pie. Led by an AI and HPC expert from Penguin Solutions, this exhibitor forum will explore what it takes to deliver AI to the masses – from cost to management of running AI architectures. The speaker will discuss options available for companies to scale their AI infrastructure, including renting AI factories in the cloud with a pay-as-you-go model versus building an AI factory of your own.
Two of the most important questions without a doubt have to be around cost and management of running AI architectures. Audience members will come away from this forum with real-world insights that they can apply directly to whatever their current AI setup is. This technical deep dive will also go in-depth on the tools you can implement, like Penguin Computing TrueHPC that can be used with AI solutions to easily build complex, high-performance environments across the many facets of your IT infrastructure.
Want to learn about the pros and cons of building an AI factory in the cloud and using a pay-as-you-go model? Or are you more interested in buying or building your very own AI factory? How does cost and performance factor into all of this? This forum will answer all of those questions and more and leave audience members with actionable takeaways that will have the power to positively impact current AI operations. We’re throwing away the notion that you have to be an established enterprise with deep pockets to run AI models and empower the supercomputing community.
Two of the most important questions without a doubt have to be around cost and management of running AI architectures. Audience members will come away from this forum with real-world insights that they can apply directly to whatever their current AI setup is. This technical deep dive will also go in-depth on the tools you can implement, like Penguin Computing TrueHPC that can be used with AI solutions to easily build complex, high-performance environments across the many facets of your IT infrastructure.
Want to learn about the pros and cons of building an AI factory in the cloud and using a pay-as-you-go model? Or are you more interested in buying or building your very own AI factory? How does cost and performance factor into all of this? This forum will answer all of those questions and more and leave audience members with actionable takeaways that will have the power to positively impact current AI operations. We’re throwing away the notion that you have to be an established enterprise with deep pockets to run AI models and empower the supercomputing community.
Workshop
Fault Handling and Tolerance
W
DescriptionThe community spent a dozen years developing production-level checkpointing techniques, such as VeloC, capable of capturing and saving extreme volumes of data with negligible overhead for the exascale scientific parallel application executions. A novel category of systems will emerge within the next ten years: Integrated Research Infrastructures. These infrastructures will connect supercomputers, scientific instrument facilities, large-scale data repositories, and collections of edge devices to form nationwide execution environments that users will share to run scientific workflows. The characteristics of IRIs and the workflow execution constraints raise a new set of unexplored research questions regarding resilience, especially execution state management. In this talk, we will first review the projected characteristics of IRIs and the user constraints regarding workflow executions. One IRI projected characteristic is the practical difficulty (probable impossibility) of capturing consistent states of the full system: resilience mechanisms will likely need to work only with an approximate system view. To address this unique resilience design characteristic, the DOE-funded SWARM project will explore a novel resilience approach based on AI-augmented distributed agents where each node of the IRIS runs an agent having a view of the system limited to its neighbors. We will review the open research questions raised by this revolutionary approach (fault types, fault detection, fault notification, execution state capture, and management) and some potential directions to address them.
Workshop
Artificial Intelligence/Machine Learning
Software Engineering
W
DescriptionRecent advances in artificial intelligence methods show the enormous potential of AI methods. The underlying concepts are embedding spaces to represent real-world information. These embedding spaces have been used to represent, transform, and work with complex information in large-language models but also many other domains such as climate sciences or automated driving systems. In this talk, we focus on embedding spaces for programs and use those primarily to assess, analyze, and improve program performance. We start by deriving a first embedding from textual LLWM internal representation (IR) and show that it successfully predicts GPU execution times of programs. We then show that textual representations bear the danger is missing context and being overly sensitive to specific strings. Using a graph-based representation, we improve the embedding to capture relationships such as data dependencies and flows in LLVM IR. Finally, we discuss DaCe's performance metaprogramming capabilities and it's programmable graph-based IR. We then demonstrate how a graph-neural network (GNN)-based embedding can capture general performance properties. Those properties form the concept of Performance Embeddings for Transfer Tuning and can be used to select optimization metaprograms to apply to transform the IR graph.
Workshop
Applications
State of the Practice
W
DescriptionCancer is complex, with contributing factors distributed across the entire genome affecting every aspect of the disease. But typical artificial intelligence and machine learning (AI/ML) would require 3B-patient training sets to generate predictive models from the whole 3B-nucleotide genome. As a result, tests remain limited to one to a few hundred genes. Prediction continues to rely mostly on such factors as a tumor’s grade and the patient’s age. And the understanding and management of cancer continue to involve guesswork.
A genome-wide pattern in glioblastoma brain cancer tumors was experimentally validated in a retrospective clinical trial as the most accurate and precise predictor of life expectancy and response to standard of care [1]. Applicable to the general population, this predictor, the first to encompass the whole genome, and predictors in lung, nerve, ovarian, and uterine cancers, were mathematically (re)discovered and computationally (re)validated in open-source datasets from as few as 50–100 patients by using our AI/ML [2,3]. Data-agnostic, our algorithms, multi-tensor comparative spectral decompositions, extend the mathematics that underlies quantum mechanics to overcome typical AI/ML obstacles by not requiring large amounts of data, balanced data, or feature engineering. All other attempts to connect a glioblastoma patient’s outcome with the tumor’s DNA copy numbers failed. For 70 years, the best indicator has been age. At 75–95% accuracy, our predictor is more accurate than and independent of age and all other indicators. Platform- and reference genome-agnostic, the predictor’s >99% precision is greater than the community consensus of <70% reproducibility based upon one to a few hundred genes. It describes mechanisms for transformation, and identifies drug targets and combinations of targets to sensitize tumors to treatment.
Now, in follow-up results from the trial we, first, show correct prospective prediction of the outcome of the five of the 79 patients who were alive four years earlier, at the time of first results. Two patients, who were predicted to have shorter survival, lived less than five years from diagnosis, whereas of the three patients predicted to have longer survival, one lived more than five, and the remaining two are alive >11.5, years from diagnosis. Second, we demonstrate 100%-precise clinical prediction for the 59 of the 79 patients with remaining tumor DNA by using whole-genome sequencing in a regulated laboratory. Third, we establish that the risk that a tumor’s whole genome confers upon outcome, as is reflected by the predictor, is surpassed only by the patient’s access to radiotherapy.
This is a proof of principle that our AI/ML is uniquely suited for personalized medicine. This also demonstrates that the inclusion of complete genomes, and the normal diversity within, is, beyond fair AI/ML, a scientific, engineering, and medical necessity, because a patient’s survival and response to treatment are the outcome of their tumor’s whole genome. We conclude that our AI/ML-derived whole-genome predictors can take the guesswork out of cancer.
[1] Ponnapalli et al., APL Bioeng 4, 026106 (2020); https://doi.org/10.1063/1.5142559
[2] Bradley et al., APL Bioeng 3, 036104 (2019); https://doi.org/10.1063/1.5099268
[3] Alter et al., PNAS 100, 3351 (2003); https://doi.org/10.1073/pnas.0530258100
A genome-wide pattern in glioblastoma brain cancer tumors was experimentally validated in a retrospective clinical trial as the most accurate and precise predictor of life expectancy and response to standard of care [1]. Applicable to the general population, this predictor, the first to encompass the whole genome, and predictors in lung, nerve, ovarian, and uterine cancers, were mathematically (re)discovered and computationally (re)validated in open-source datasets from as few as 50–100 patients by using our AI/ML [2,3]. Data-agnostic, our algorithms, multi-tensor comparative spectral decompositions, extend the mathematics that underlies quantum mechanics to overcome typical AI/ML obstacles by not requiring large amounts of data, balanced data, or feature engineering. All other attempts to connect a glioblastoma patient’s outcome with the tumor’s DNA copy numbers failed. For 70 years, the best indicator has been age. At 75–95% accuracy, our predictor is more accurate than and independent of age and all other indicators. Platform- and reference genome-agnostic, the predictor’s >99% precision is greater than the community consensus of <70% reproducibility based upon one to a few hundred genes. It describes mechanisms for transformation, and identifies drug targets and combinations of targets to sensitize tumors to treatment.
Now, in follow-up results from the trial we, first, show correct prospective prediction of the outcome of the five of the 79 patients who were alive four years earlier, at the time of first results. Two patients, who were predicted to have shorter survival, lived less than five years from diagnosis, whereas of the three patients predicted to have longer survival, one lived more than five, and the remaining two are alive >11.5, years from diagnosis. Second, we demonstrate 100%-precise clinical prediction for the 59 of the 79 patients with remaining tumor DNA by using whole-genome sequencing in a regulated laboratory. Third, we establish that the risk that a tumor’s whole genome confers upon outcome, as is reflected by the predictor, is surpassed only by the patient’s access to radiotherapy.
This is a proof of principle that our AI/ML is uniquely suited for personalized medicine. This also demonstrates that the inclusion of complete genomes, and the normal diversity within, is, beyond fair AI/ML, a scientific, engineering, and medical necessity, because a patient’s survival and response to treatment are the outcome of their tumor’s whole genome. We conclude that our AI/ML-derived whole-genome predictors can take the guesswork out of cancer.
[1] Ponnapalli et al., APL Bioeng 4, 026106 (2020); https://doi.org/10.1063/1.5142559
[2] Bradley et al., APL Bioeng 3, 036104 (2019); https://doi.org/10.1063/1.5099268
[3] Alter et al., PNAS 100, 3351 (2003); https://doi.org/10.1073/pnas.0530258100
Workshop
Architecture and Networks
W
DescriptionIn this work, we introduce Altis-SYCL, a benchmark suite based on SYCL for GPUs and FPGAs. For developing Altis-SYCL, we leverage the oneAPI heterogeneous programming framework in two consecutive steps: 1) by using the modern Altis GPGPU benchmark suite as baseline and migrating it from CUDA to SYCL, and 2) by exploring several techniques to optimize the performance of the resulting SYCL code. Our migration-and-optimization methodology starts targeting GPUs and progressively moves towards FPGAs. In this process, we discuss the differences between device-specific strategies as well as detailing the required code refactoring and optimization efforts. The performance of Altis-SYCL was evaluated on Stratix 10 and Agilex FPGAs, and for some applications, their execution runtimes were competitive with those achieved on latest high-end GPUs. The corresponding code is released as open source
under: https://github.com/esa-tu-darmstadt/altis_sycl.
under: https://github.com/esa-tu-darmstadt/altis_sycl.
Birds of a Feather
HPC in Society
TP
XO/EX
DescriptionThe SC23 edition of the Birds of a Feather Americas High-Performance Computing Collaboration: Global Actions seeks to showcase collaborations that have resulted from the partnerships formed since the first edition at SC19, presenting opportunities and experiences between different HPC Networks and Laboratories from countries in North, Central, and South America with other continents, mainly with Europe. In the BoF, different aspects will be discussed around the expectations and experiences of collaboration in HPC, to feed the continental roadmap. This BoF is a crucial step to support the signature of an MoU to start the formalization of the Americas HPC Collaboration.
Paper
Accelerators
Data Analysis, Visualization, and Storage
Data Compression
TP
DescriptionAs supercomputers advance toward exascale capabilities, computational intensity increases significantly, and the volume of data requiring storage and transmission experiences exponential growth. Adaptive Mesh Refinement (AMR) has emerged as an effective solution to address these two challenges. Concurrently, error-bounded lossy compression is recognized as one of the most efficient approaches to tackle the latter issue. Despite their respective advantages, few attempts have been made to investigate how AMR and error-bounded lossy compression can function together. To this end, this study presents a novel in-situ lossy compression framework that employs the HDF5 filter to improve both I/O costs and boost compression quality for AMR applications. We implement our solution into the AMReX framework and evaluate on two real-world AMR applications, Nyx and WarpX, on the Summit supercomputer. Experiments with 512 cores demonstrate that AMRIC improves the compression ratio by 81x and the I/O performance by 39x over AMReX's original compression solution.
Workshop
State of the Practice
W
DescriptionAs high-performance computing approaches the exascale era, the analysis of the vast amount of monitoring data generated by supercomputers has become increasingly challenging for data analysts. The detection of change points, which plays a critical role in anomaly detection, performance optimization, and root cause analysis of problems and failures, has grown beyond human capacity for manual review. To address this issue, our focus lies in developing an effective model capable of identifying anomalous behavior, and to achieve this, we introduce the concept of an online adaptive sampling algorithm. By evaluating the model's performance across various use cases, we conduct tests on a complex datasets to detect change points. Overall, we observe that the model successfully captures key features of normal behavior, and we believe it opens promising avenues for further research, particularly in assisting with various tasks related to anomaly detection and performance optimization in high-performance computing environments.
Workshop
Artificial Intelligence/Machine Learning
Graph Algorithms and Frameworks
W
DescriptionGraph Neural Networks (GNNs) are becoming increasingly popular for applying neural networks to graph data. However, as the size of the input graph increases, the GPU memory wall problem becomes an important issue. Since both current solutions to reduce the memory footprint, such as mini-batch approaches and the use of memory-efficient tensor manipulations, have drawbacks, we attempt to solve the problem by expanding the memory size using a virtual memory technology. To overcome the data transfer overhead of virtual memory technology, in this paper we focus on analyzing the memory access pattern of GNNs with the goal of reducing the data transfer latency perceived by the user. A preliminary result of applying optimization techniques guided by our analysis results shows a 40% reduction in the execution time of a combination of training and testing.
Workshop
Large Scale Systems
Middleware and System Software
Programming Frameworks and System Software
W
DescriptionIn conventional multi-GPU configurations, the host manages execution, kernel launches, communication, and synchronization, incurring unnecessary overhead. To mitigate this, we present a CPU-free model that delegates control to the devices themselves, especially benefiting communication-intensive applications. Utilizing techniques such as persistent kernels, specialized thread blocks, and device-initiated communication, we create autonomous multi-GPU code that drastically reduces communication overhead. Our approach is demonstrated with popular solvers, including 2D/3D Jacobian stencil and Conjugate Gradient (CG). We are currently developing its compiler technology, applying the model to a broader set of applications and its debugging/profiling tools.
Posters
Research Posters
TP
XO/EX
DescriptionResource disaggregation is prevalent in datacenters since it provides high resource utilization when compared to servers dedicated to either compute, memory, or storage. NVMe-over-Fabrics (NVMe-oF) is the standardized protocol used for accessing disaggregated storage over the network. Currently, the NVMe-oF specification lacks any semantics to prioritize I/O requests based on different application needs. Since applications have varying goals — latency-sensitive or throughput-critical I/O — we need to design efficient schemes in order to allow applications to specify the type of performance they wish to achieve. Furthermore, with additional tenants, we need to provide the respective specified performance optimizations that each application requests, regardless of congestion. This is a challenging problem, as the current NVMe specification lacks semantics to support multi-tenancy. Our research poster brings awareness to the ways in which we can bring multi-tenancy support to the NVMe-oF specification.
Workshop
Artificial Intelligence/Machine Learning
Graph Algorithms and Frameworks
W
DescriptionTraditional graph-processing algorithms have been widely used in Graph Neural Networks (GNNs). Current approaches to graph processing in deep learning face two main problems. Firstly, easy-to-use deep learning libraries lack support for widely used graph processing algorithms and do not provide low-level APIs for building distributed graph processing algorithms. Secondly, existing graph processing libraries are not user-friendly for deep learning researchers. This paper presents an efficient and easy-to-use graph engine that incorporates distributed graph processing into deep-learning ecosystems. We develop a distributed graph storage system with an efficient batching technique to minimize communication overhead incurred by Remote Procedure Calls between computing nodes. We propose an optimized method for distributed computation of Single Source Personalized PageRank (SSPPR) using the Forward Push algorithm based on lock-free parallel maps. Experimental evaluations demonstrate significant improvement, with up to three orders of magnitude in SSPPR throughput, of our graph engine compared with the tensor-based implementation.
Workshop
Architecture and Networks
Hardware Technologies
W
DescriptionIn this work we perform one of the first in-depth, empirical comparisons of the Arm and RISC-V instruction sets. We compare a series of benchmarks compiled with GCC 9.2 and 12.2, targeting the scalar subsets of Arm's Armv-8a and RISC-V's rv64g. We analyze instruction counts, critical paths and windowed critical paths to get an estimate of performance differences between the two instruction sets, determining where each has advantages and disadvantages. The results show the instruction sets are relatively closely matched on the metrics we evaluated for the benchmarks we considered, indicating that neither ISA has a large, inherent advantage over the other, architecturally.
Workshop
Artificial Intelligence/Machine Learning
Energy Efficiency
Green Computing
Performance Measurement, Modeling, and Tools
Sustainability
W
DescriptionHigh-Performance Computing (HPC) centers demand a lot of power, and continue to grow through the exascale era. This work establishes the need for a multi-tiered, feedback-driven power management framework to follow dynamic power objectives while maximizing job performance, highlighting the need to respond to external factors (e.g., power constraints), and internal factors (e.g., performance variation). We present a practical implementation of this framework on a real-world cluster in addition to conducting simulations for larger data centers. We accurately track a moving power target for demand response while reacting to incomplete or inaccurate prior knowledge about job power and performance properties. We demonstrate that online performance feedback from a job runtime enables a cluster power management policy to recover most of the performance degradation introduced by job-type misclassification.
Workshop
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
W
DescriptionThe MPI 4.0 standard introduced the concept of partitioned point-to-point communication. One facet that may help in encouraging application developers to use this new concept in their programs is the availability of proper tool support in a timely manner. We therefore propose nine new events extending the OTF2 event model to accurately represent the runtime behavior of partitioned point-to-point communication. We then demonstrate the suitability of these extensions with three different use cases in the context of performance analysis. In particular, we showcase a prototype implementation of an extended waitstate analysis in the Scalasca trace analyzer, and discuss further potential use cases in the realm of trace visualization and simulation.
Workshop
Quantum Computing
Software Engineering
W
DescriptionA crucial step in compiling a quantum algorithm involves addressing a layout problem to meet the device's layout constraints. The Qubit Mapping and Routing (QMR) problem aims to minimize the number of SWAP gates added to the circuit to fulfill NISQ hardware's connectivity constraints. Although this problem is NP-hard, finding solutions quickly is vital as it is part of the compilation process.
In this research, we present the QMR problem as a Quadratic Unconstrained Binary Optimization problem (QUBO) and utilize specialized hardware, the Fujitsu Digital Annealer, for faster solving. Experiments on various benchmarks are conducted, comparing our approach to popular methods like Qiskit and tket. Remarkably, our method achieves the optimal solutions for almost all instances in the QUEKO benchmark, outperforming other solvers significantly. Furthermore, we demonstrate our approach's superior performance in various instances when compared to other application-specific quantum circuits.
In this research, we present the QMR problem as a Quadratic Unconstrained Binary Optimization problem (QUBO) and utilize specialized hardware, the Fujitsu Digital Annealer, for faster solving. Experiments on various benchmarks are conducted, comparing our approach to popular methods like Qiskit and tket. Remarkably, our method achieves the optimal solutions for almost all instances in the QUEKO benchmark, outperforming other solvers significantly. Furthermore, we demonstrate our approach's superior performance in various instances when compared to other application-specific quantum circuits.
Workshop
Education
State of the Practice
W
DescriptionThis work presents an overview of an NSF Research Experience for Undergraduate Site on Trust and Reproducibility of Intelligent Computation, delivered by faculty and graduate students in the Kahlert School of Computing at University of Utah. The chosen themes bring together several concerns for the future in producing computational results that can be trusted: secure, reproducible, based on sound algorithmic foundations, and developed in the context of ethical considerations. The research areas represented by student projects include machine learning, high-performance computing, algorithms and applications, computer security, data science, and human-centered computing. In the first four weeks of the program, the entire student cohort spent their mornings in lessons from experts in these crosscutting topics, and used one-of-a-kind research platforms operated by the University of Utah, namely NSF-funded CloudLab and POWDER facilities. This program can serve as a model for preparing a future workforce to integrate ML into trustworthy reproducible applications.
Workshop
Middleware and System Software
Programming Frameworks and System Software
Runtime Systems
W
DescriptionIn the high performance computing (HPC) domain, performance variability is a major scalability issue for parallel computing applications with heavy synchronization and communication. We present an experimental performance analysis of OpenMP benchmarks regarding the variation of execution time, and determine the potential factors causing performance variability.
Our work offers some understanding of performance distributions and directions for future work on how to mitigate variability for OpenMP-based applications. Two representative OpenMP benchmarks from the EPCC OpenMP micro-benchmark suite and BabelStream are run across two x86 multicore platforms featuring up to 256 threads. From the obtained results, we characterize and explain the execution time variability as a function of thread-pinning, simultaneous multithreading (SMT) and core frequency variation.
Our work offers some understanding of performance distributions and directions for future work on how to mitigate variability for OpenMP-based applications. Two representative OpenMP benchmarks from the EPCC OpenMP micro-benchmark suite and BabelStream are run across two x86 multicore platforms featuring up to 256 threads. From the obtained results, we characterize and explain the execution time variability as a function of thread-pinning, simultaneous multithreading (SMT) and core frequency variation.
Workshop
Accelerators
Applications
Compilers
Heterogeneous Computing
Programming Frameworks and System Software
Runtime Systems
W
DescriptionWith the advent of GPUs in parallel computing several languages, tools and compilers are being developed. Many impactful applications can benefit from the performance capabilities these GPUs provide, but moving large, complex code bases to GPU execution often poses many hurdles and growing pains as developers adapt unfamiliar programming models and interface with increasingly complex, but powerful hardwares. Our work discusses experiences using OpenACC to bring GPU acceleration to MURaM, a state-of-the-art solar physics application, including various problems we have explored and overcome to bring better performance portability to the code within the limitations of the programming model. We then provide scaling results and findings transitioning to current generation GPU architectures with strong and weak scaling on up to 512 NVIDIA A100 GPUs, observing one A100 GPU as comparable to 90-100 CPU cores, and GPUs scaling much further than the CPU runs are capable.
Workshop
Data Analysis, Visualization, and Storage
Data Compression
W
DescriptionToday’s scientific simulations generate exceptionally large volumes of data, challenging the capacities of available I/O bandwidth and storage space. This necessitates a substantial reduction in data volume, for which error-bounded lossy compression has emerged as a highly effective strategy. A crucial metric for assessing the efficacy of lossy compression is visualization. Despite extensive research on the impact of compression on visualization, there is a notable gap in the literature concerning the effects of compression on the visualization of Adaptive Mesh Refinement (AMR) data. AMR has proven to be a potent solution for the rising computational intensity and the explosive growth in data volume. However, the hierarchical and multi-resolution characteristics of AMR data introduce unique challenges to its visualization, and these challenges are further compounded when data compression comes into play. This article study the intricacies of how data compression influences and introduces novel challenges to the visualization of AMR data.
Birds of a Feather
Data Analysis, Visualization, and Storage
TP
XO/EX
DescriptionParallel I/O performance can be a critical bottleneck for applications, yet users often need to be equipped for identifying and diagnosing I/O performance issues. Increasingly complex hierarchies of storage hardware and software deployed on many systems only compound this problem. Tools that can effectively capture, analyze, and tune I/O behavior for these systems empower users to realize performance gains for many applications.
In this BoF, we form a community around best practices in analyzing parallel I/O and cover recent advances to help address the above-mentioned problem, drawing on the expertise of users, I/O researchers, and administrators in attendance.
In this BoF, we form a community around best practices in analyzing parallel I/O and cover recent advances to help address the above-mentioned problem, drawing on the expertise of users, I/O researchers, and administrators in attendance.
Workshop
Distributed Computing
Security
W
DescriptionBoth 3rd generation Xeon scalable processors and Gramine 1.0, which potentially improves the performance of Intel SGX, were released in 2021. In this paper, we provide the first performance analysis of HPC workloads with Gramine and SGX on 3rd generation Xeon scalable processors. Our analysis starts with some microbenchmarks and is then extended to various HPC workloads. Our experimental results show that Gramine+SGX shows a small performance overhead (4-17%) for both compute-intensive and memory-bandwidth-sensitive workloads but a bit large performance overhead (up to 170%) for a memory-latency-sensitive workload. In addition, we show that the combination of Gramine and a 3rd generation Xeon scalable processor shows a slowdown of 1.5x on average (up to 4.4x) for many HPC workloads. This number is an order of magnitude smaller than that reported in the previous work using the combination of the former generation SGX toolchain and processor.
Paper
ANT-MOC: Scalable Neutral Particle Transport Using 3D Method of Characteristics on Multi-GPU Systems
Accelerators
Applications
Modeling and Simulation
TP
Best Paper Finalist
Best Student Paper Finalist
DescriptionThe Method Of Characteristic (MOC) to solve the Neutron Transport Equation (NTE) is the core of full-core simulation for reactors. High resolution is enabled by discretizing the NTE through massive tracks to traverse the 3D reactor geometry. However, the 3D full-core simulation is prohibitively expensive because of the high memory consumption and the severe load imbalance. To deal with these challenges, we develop ANT-MOC. Specifically, we build a performance model for memory footprint, computation and communication, based on which a track management strategy is proposed to overcome the resolution bottlenecks caused by limited GPU memory. Furthermore, we implement a novel multi-level load mapping strategy to ensure load balancing among nodes, GPUs, and CUs. ANT-MOC enables a 3D full-core reactor simulation with 100 billion tracks on 16,000 GPUs, with 70.69% and 89.38% parallel efficiency for strong scalability and weak scalability, respectively.
Paper
Artificial Intelligence/Machine Learning
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
Tensors
TP
DescriptionPerformance tuning, software/hardware co-design, and job scheduling are among the many tasks that rely on models to predict application performance. We propose and evaluate low-rank tensor decomposition for modeling application performance. We discretize the input and configuration domains of an application using regular grids. Application execution times mapped within grid-cells are averaged and represented by tensor elements. We show that low-rank canonical-polyadic (CP) tensor decomposition is effective in approximating these tensors. We further show that this decomposition enables accurate extrapolation of unobserved regions of an application's parameter space. We then employ tensor completion to optimize a CP decomposition given a sparse set of observed execution times. We consider alternative piecewise/grid-based models and supervised learning models for six applications and demonstrate that CP decomposition optimized using tensor completion offers higher prediction accuracy and memory-efficiency for high-dimensional performance modeling.
Birds of a Feather
State of the Practice
TP
XO/EX
DescriptionThis BoF provides a forum for Fortran developers to engage with its modern programming features. Fortran continues to play a crucial role in numerous legacy applications, but with features introduced in recent standards, the language also supports modern programming practices and high-performance computing. As Fortran 2023 approaches, this BoF brings together developers from various domains to share experiences and explore the language's evolving capabilities. After some brief panelist presentations, the session will focus on an interactive discussion where audience members will be encouraged to share their own experiences and ask questions of our panelists.
Posters
Research Posters
TP
XO/EX
DescriptionType Ia Supernovae are highly luminous thermonuclear explosions of white dwarfs which serve as standardizable distance markers for investigating the accelerating expansion of our Universe. Most existing supernovae simulation codes are only designed to run on homogeneous CPU-only systems and do not take advantage of the increasing shift towards heterogeneous architectures in HPC. To address this, we present Ares, the first performance portable massively-parallel code for simulating thermonuclear burn fronts. By creating multi-physics modules using the Kokkos and Parthenon frameworks, we are able to scale supernovae simulations to distributed HPC clusters operating on any of CUDA, HIP, SYCL, HPX, OpenMP and serial backends. We evaluate our application by conducting weak and strong scaling studies on both CPU and GPU clusters, showing the efficiency of our method for a diverse set of targets.
Birds of a Feather
State of the Practice
TP
XO/EX
DescriptionThis BoF brings together the Arm HPC community to discuss experiences and lessons learnt in delivering and operating Arm-based HPC systems. The topic of Arm HPC ecosystem maturity has been extensively discussed, focusing especially on the upper part of the stack (compiler, libraries, applications). This BoF focuses instead on the other side of the coin with a focus on administration and management of systems. Primed by a short opening session from well-recognized experts in the community, the host and panel will engage attendees to share and ask probing questions. Audience participation is strongly encouraged.
Workshop
Fault Handling and Tolerance
W
DescriptionHigh-performance computing applications are increasingly integrating checkpointing libraries for reproducibility analytics. However, capturing an entire checkpoint history for reproducibility study faces the challenges of high-frequency checkpointing across thousands of processes. As a result, the runtime overhead affects application performance and intermediate results when interleaving is introduced during floating-point calculations. In this paper, we extend asynchronous multi-level checkpoint/restart to study the intermediate results generated from scientific workflows. We present an initial prototype of a framework that captures, caches and compares checkpoint histories from different runs of a scientific application executed using identical input files. We also study the impact of our proposed approach by evaluating the reproducibility of classical molecular dynamics simulations executed using the NWChem software. Experiment results show that our proposed solution improves the checkpoint write bandwidth when capturing checkpoints for reproducibility analysis by a minimum of 30x and up to 211x compared to the default NWChem checkpointing approach.
Workshop
Artificial Intelligence/Machine Learning
Energy Efficiency
Green Computing
Performance Measurement, Modeling, and Tools
Sustainability
W
DescriptionAs modern High-Performance Computing (HPC) reach exascale performance, their power consumption becomes a serious threat to environmental and energy sustainability. Efficient power management in HPC systems is crucial for optimizing workload management, reducing operational costs, and promoting environmental sustainability. Accurate prediction of job power consumption plays an important role in achieving such goals. We apply a technique combining Machine Learning (ML) algorithms with Natural Language Processing (NLP) tools to predict job power consumption. The solution is able to predict job maximum and average power consumption per node, leveraging only information which is available at the time of job submission. The prediction is performed in an online fashion, and we validate the approach using batch system logs extracted from Supercomputer Fugaku. The experimental evaluation shows promising results of outperforming classical technique while obtaining an R2 score of more than 0.53 for our two prediction tasks.
Workshop
Education
Heterogeneous Computing
Reproducibility
State of the Practice
W
DescriptionTechnological advancements have led to an increase in teaching the fundamentals of robotics and autonomous systems and their importance, relying on strong hands-on practical experimentation. National Science Foundation (NSF)-supported testbeds have opened the doors for experimentation and support in the next era of computing platforms and large-scale cloud research.
We present an open-source educational module that conveys accessibility to education, aiming to prepare learners for technological career paths. Our educational module is developed with the motivation to bring hands-on sessions and allow students to attain knowledge in a comprehensive manner. Specifically, we present AutoLearn: Learning in the Edge to Cloud Continuum, an educational module that integrates a collection of educational artifacts, based on an open-source self-driving platform for small scale that leverages the Chameleon Cloud testbed to teach cloud computing concepts, edge devices technology, and artificial intelligence driven applications.
We present an open-source educational module that conveys accessibility to education, aiming to prepare learners for technological career paths. Our educational module is developed with the motivation to bring hands-on sessions and allow students to attain knowledge in a comprehensive manner. Specifically, we present AutoLearn: Learning in the Edge to Cloud Continuum, an educational module that integrates a collection of educational artifacts, based on an open-source self-driving platform for small scale that leverages the Chameleon Cloud testbed to teach cloud computing concepts, edge devices technology, and artificial intelligence driven applications.
Paper
Heterogeneous Computing
Programming Frameworks and System Software
Task Parallelism
TP
DescriptionIn a parallel and distributed application, a mapping is a selection of a processor for each computation or task and memories for the data collections that each task accesses. Finding high-performance mappings is challenging, particularly on heterogeneous hardware with multiple choices for processors and memories. We show that fast mappings are sensitive to the machine, application, and input. Porting to a new machine, modifying the application, or using a different input size may necessitate re-tuning the mapping to maintain the best possible performance.
We present AutoMap, a system that automatically tunes the mapping to the hardware used and finds fast mappings without user intervention or code modification. In contrast, hand-written mappings often require days of experimentation. AutoMap utilizes a novel constrained coordinate-wise descent search algorithm that balances the trade-off between running computations quickly and minimizing data movement. AutoMap discovers mappings up to 2.41x faster than custom, hand-written mappers.
We present AutoMap, a system that automatically tunes the mapping to the hardware used and finds fast mappings without user intervention or code modification. In contrast, hand-written mappings often require days of experimentation. AutoMap utilizes a novel constrained coordinate-wise descent search algorithm that balances the trade-off between running computations quickly and minimizing data movement. AutoMap discovers mappings up to 2.41x faster than custom, hand-written mappers.
Workshop
Applications
State of the Practice
W
DescriptionBackground: Cancer is the second leading cause of death in the United States [1]. Automatic characterization of malignant disease is an important clinical need to facilitate early detection and treatment of cancer [2]. Advances in machine learning (ML) and deep learning (DL) have shown significant promise for radiological and oncological applications [3]. Radiomic analysis extracts quantitative features from radiologic data about a cancerous tumor [4]. DL methods require large training datasets with sufficiently annotated images, which are difficult to obtain for radiological applications. The objective of this study was to develop a deep semi-supervised transfer learning approach for automated whole-body tumor segmentation and prognosis on positron emission tomography (PET)/computed tomography (CT) scans using limited annotations (Fig. 1a).
Methods: Five datasets consisting of 1,019 prostate, lung, melanoma, lymphoma, head and neck, and breast cancer patients with prostate-specific membrane antigen (PSMA) and fluorodeoxyglucose (FDG) PET/CT scans were used in this study (Table 1). A nnUnet backbone was cross-validated on the tumor segmentation task via a 5-fold cross-validation. Predicted segmentations were iteratively improved using radiomic analysis. Transfer learning generalized the segmentation task across PSMA and FDG PET/CT. Segmentation accuracy was evaluated on true positive rate (TPR), positive predictive value (PPV), Dice similarity coefficient (DSC), false discovery rate (FDR), true negative rate (TNR), and negative predictive value (NPV). Imaging measures quantifying molecular tumor burden and uptake were extracted from the predicted segmentations. A risk stratification model was developed for prostate cancer by combining the extracted imaging measures and was evaluated on follow-up prostate-specific antigen (PSA) levels. A risk stratification model was developed for head and neck cancer patients by combining imaging measures and American Joint Committee on Cancer (AJCC) staging and was evaluated via Kaplan-Meier survival analysis. A prognostic model was developed to predict pathological response of breast cancer patients to neoadjuvant chemotherapy using imaging measures from pre-therapy and post-therapy PET/CT scans. Prognostic models were evaluated on overall accuracy and area under the receiver operating characteristic (AUROC) curve. Statistically significant differences were inferred using a Wilcoxon rank-sum test.
Results: Accuracy metrics and illustrative examples of predicted tumor segmentations are shown in Table 2 and Fig. 1b. The risk stratification model yielded an overall accuracy of 0.83 and an AUROC of 0.86 in stratifying prostate cancer patients (Fig. 1c). Median follow-up PSA levels in the low-intermediate and high risk groups were 1.19 ng/mL and 53.20 ng/mL (P < 0.05). Head and neck cancer patients were stratified into low, intermediate, and high risk groups with significantly different Kaplan-Meier survival curves by the log-rank test (Fig. 1d). A prognostic model using imaging measures from pre-therapy scans predicted pathological complete response (pCR) in breast cancer patients with an accuracy of 0.72 and an AUROC of 0.72. The model using imaging measures from both pre-therapy and post-therapy scans predicted pCR in breast cancer patients with an accuracy of 0.84 and an AUROC of 0.76.
Conclusion: A deep semi-supervised transfer learning approach was developed and demonstrated accurate tumor segmentation, quantification, and prognosis on PET/CT of patients across six cancer types.
Methods: Five datasets consisting of 1,019 prostate, lung, melanoma, lymphoma, head and neck, and breast cancer patients with prostate-specific membrane antigen (PSMA) and fluorodeoxyglucose (FDG) PET/CT scans were used in this study (Table 1). A nnUnet backbone was cross-validated on the tumor segmentation task via a 5-fold cross-validation. Predicted segmentations were iteratively improved using radiomic analysis. Transfer learning generalized the segmentation task across PSMA and FDG PET/CT. Segmentation accuracy was evaluated on true positive rate (TPR), positive predictive value (PPV), Dice similarity coefficient (DSC), false discovery rate (FDR), true negative rate (TNR), and negative predictive value (NPV). Imaging measures quantifying molecular tumor burden and uptake were extracted from the predicted segmentations. A risk stratification model was developed for prostate cancer by combining the extracted imaging measures and was evaluated on follow-up prostate-specific antigen (PSA) levels. A risk stratification model was developed for head and neck cancer patients by combining imaging measures and American Joint Committee on Cancer (AJCC) staging and was evaluated via Kaplan-Meier survival analysis. A prognostic model was developed to predict pathological response of breast cancer patients to neoadjuvant chemotherapy using imaging measures from pre-therapy and post-therapy PET/CT scans. Prognostic models were evaluated on overall accuracy and area under the receiver operating characteristic (AUROC) curve. Statistically significant differences were inferred using a Wilcoxon rank-sum test.
Results: Accuracy metrics and illustrative examples of predicted tumor segmentations are shown in Table 2 and Fig. 1b. The risk stratification model yielded an overall accuracy of 0.83 and an AUROC of 0.86 in stratifying prostate cancer patients (Fig. 1c). Median follow-up PSA levels in the low-intermediate and high risk groups were 1.19 ng/mL and 53.20 ng/mL (P < 0.05). Head and neck cancer patients were stratified into low, intermediate, and high risk groups with significantly different Kaplan-Meier survival curves by the log-rank test (Fig. 1d). A prognostic model using imaging measures from pre-therapy scans predicted pathological complete response (pCR) in breast cancer patients with an accuracy of 0.72 and an AUROC of 0.72. The model using imaging measures from both pre-therapy and post-therapy scans predicted pCR in breast cancer patients with an accuracy of 0.84 and an AUROC of 0.76.
Conclusion: A deep semi-supervised transfer learning approach was developed and demonstrated accurate tumor segmentation, quantification, and prognosis on PET/CT of patients across six cancer types.
Workshop
Artificial Intelligence/Machine Learning
Energy Efficiency
Green Computing
Performance Measurement, Modeling, and Tools
Sustainability
W
DescriptionWe introduce a novel energy-efficient job scheduling approach for High-Performance Computing (HPC) environments. Its primary objective is to bridge the gap between research and production in energy-efficient scheduling models for HPC. The proposed architecture and program decouple scheduling heuristics to a Python application in the HPC scheduler SLURM, enabling adaptability for production setups. The implementation demonstrates an 11% potential energy saving in the High-Performance Conjugate Gradients (HPCG) benchmark, highlighting the practicality of the approach in a single-node HPC cluster. This work serves as a foundation for integrating research in the area into production, offering a realistic example of energy-efficient HPC in practice. It also opens possibilities for more advanced applications, like automatically scheduling jobs during low-cost and renewable energy periods, as already used by companies employing HPC. This contribution showcases a practical, energy-efficient solution for HPC job scheduling and identifies potential for future enhancements in this area.
Paper
Artificial Intelligence/Machine Learning
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
Tensors
TP
DescriptionWhile considerable research has been directed at automatic parallelization for shared-memory platforms, little progress has been made in automatic parallelization schemes for distributed-memory systems. We introduce an innovative approach to automatically produce distributed-memory parallel code for an important sub-class of affine tensor computations common to Coupled Cluster (CC) electronic structure methods, neuro-imaging applications, and deep learning models.
We propose a novel systematic approach to modeling the relations and trade-offs of mapping computations and data onto multi-dimensional grids of homogeneous nodes. Our formulation explores the space of computation and data distributions across processor grids. Tensor programs are modeled as a non-linear symbolic formulation accounting for the volume of data communication and per-node capacity constraints induced under specific mappings. Solutions are found, iteratively, using the Z3 SMT solver, and used to automatically generate efficient MPI code. Our evaluation demonstrates the effectiveness of our approach over Distributed-Memory Pluto and the Cyclops Tensor Framework.
We propose a novel systematic approach to modeling the relations and trade-offs of mapping computations and data onto multi-dimensional grids of homogeneous nodes. Our formulation explores the space of computation and data distributions across processor grids. Tensor programs are modeled as a non-linear symbolic formulation accounting for the volume of data communication and per-node capacity constraints induced under specific mappings. Solutions are found, iteratively, using the Z3 SMT solver, and used to automatically generate efficient MPI code. Our evaluation demonstrates the effectiveness of our approach over Distributed-Memory Pluto and the Cyclops Tensor Framework.
Workshop
Architecture and Networks
Hardware Technologies
W
DescriptionIn this paper, we propose and evaluate several optimized implementations of the general matrix multiplication (Gemm) on two different RISC-V architecture cores implementing the RISC-V vector extension (RVV): C906 and C910 from T-HEAD. Specifically, we address the performance portability problem across these processor cores by means of an automatic assembly code generator, written in Python, capable of emitting RVV code for high performance computing (HPC), with a variety of combinations of specific and general optimizations.
Our experimental results using a number of automatically-generated micro-kernels for Gemm, on both RISC-V architectures, reveal different impact of each optimization, depending on the target architecture, and highlight the importance of automatically generating HPC RVV code to achieve performance portability while reducing the developers' effort. In addition, these optimizations show important performance gains with respect to to a state-of-the-art tuned BLAS library (OpenBLAS), reaching 3x and 1.3x speed-ups for the C910 and C906, respectively.
Our experimental results using a number of automatically-generated micro-kernels for Gemm, on both RISC-V architectures, reveal different impact of each optimization, depending on the target architecture, and highlight the importance of automatically generating HPC RVV code to achieve performance portability while reducing the developers' effort. In addition, these optimizations show important performance gains with respect to to a state-of-the-art tuned BLAS library (OpenBLAS), reaching 3x and 1.3x speed-ups for the C910 and C906, respectively.
Workshop
Performance Optimization
W
DescriptionThe rapid development in machine learning (ML) has prompted demand for low-precision arithmetic hardware that can deliver faster computing speed. Weather simulation applications typically exhibit higher sensitivity towards small perturbation on the input data, but the inherent uncertainty paves the way for opportunities in mixed-precision computing (MPC) by trading accuracy for performance. Additional challenges of balancing between the lower computational cost and accuracy requirements need to be addressed before successful MPC can be applied to weather modeling applications. Determining an acceptable precision allocation for variables involves interacting with an exponential search space of mixed-precision configurations. We propose a mixed-precision code tuning framework for automatic search of suitable precision configurations for weather modeling applications with black-box optimization algorithms. The search results achieve up to 30% performance gain that stays within the tolerance level, offering a workflow to facilitate the identification of variables sensitive to precision change.
Posters
Research Posters
TP
XO/EX
DescriptionThe increasing demand for processing power on resource-constrained edge devices necessitate efficient techniques for optimizing High Performance Computing (HPC) applications. We propose HPEE (HPC Parameter Exploration on Edge), a novel approach that formulates the parameter search space problem as a pure exploration multi-armed bandit (MAB) technique. By efficiently exploring the search space using the MAB framework, we achieve significant performance improvements, while respecting the limited computational resources of edge devices. Experimental results, based on HPC application, demonstrate the effectiveness of our approach in optimizing parameter search on edge devices, offering a promising solution for enhancing HPC performance in resource-constrained environments.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionApache TVM (Tensor Virtual Machine), an open source machine learning compiler framework designed to optimize computations across various hardware platforms, provides an opportunity to improve the performance of dense matrix factorizations such as LU (Lower Upper) decomposition and Cholesky decomposition on GPUs, FPGAs, ASICs, and AI accelerators. In this paper, we propose a new TVM autotuning framework using Bayesian Optimization and use the TVM tensor expression language to implement linear algebra kernels such as LU, Cholesky, and 3mm. We use these scientific computation kernels to evaluate the effectiveness of our methods on a GPU cluster, called Swing, at Argonne National Laboratory. We compare the proposed autotuning framework with the TVM autotuning framework AutoTVM with four tuners and find that our framework outperforms AutoTVM in most cases.
Posters
Research Posters
TP
XO/EX
DescriptionDistributed large model inference is still in a dilemma where balancing latency and throughput, or rather cost and effect. Tensor parallelism, while capable of optimizing latency, entails a substantial expenditure. Conversely, pipeline parallelism excels in throughput but falls short in minimizing execution time.
To address this challenge, we introduce a novel solution - interleaved parallelism. This approach interleaves computation and communication across requests. Our proposed runtime system harnesses GPU scheduling techniques to facilitate the overlapping of communication and computation kernels, thereby enabling this pioneering parallelism for distributed large model inference. Extensive evaluations show that our proposal outperforms existing parallelism approaches across models and devices, presenting the best latency and throughput in most cases.
To address this challenge, we introduce a novel solution - interleaved parallelism. This approach interleaves computation and communication across requests. Our proposed runtime system harnesses GPU scheduling techniques to facilitate the overlapping of communication and computation kernels, thereby enabling this pioneering parallelism for distributed large model inference. Extensive evaluations show that our proposal outperforms existing parallelism approaches across models and devices, presenting the best latency and throughput in most cases.
Workshop
Programming Frameworks and System Software
W
DescriptionDuring software development, many aspects of the system and user state can change. Significant time can be spent tracking down the causes of these differences, rather than focusing on the main task of software development. This paper describes a tool to record the state at build-time and at runtime of an application to more easily investigate the cause(s) of differences in behavior. The added logging enables better software quality assurance by tracking code changes and their effects on runtime behavior. At a minimum, this tool only requires prepending one command at build-time and another at runtime. Project-level configurations can be set to enable the collection of additional information.
Workshop
Artificial Intelligence/Machine Learning
Distributed Computing
W
Workshop
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionSimulations of Lattice Quantum Chromodynamics (LQCD) are an important application (two digit percentage of cycles) on major High Performance Computing (HPC) installations, including systems high up on and leading the top500 list. In the rapidly changing hardware landscape of HPC, binding workforce to optimize simulation software for every architecture becomes a sustainability issue.
In this work, we explore the feasibility of using performance portable parallel code for an important LQCD kernel. Fusing the Kokkos C++ Performance Portability EcoSystem with MPI allows to scale on massive parallel machines while still being able to target a plentitude of different architectures with the same simple code. We report on benchmarking results for a range of currently deployed and recently introduced systems, including AMD EPYC 7742, AMD MI250, Fujitsu A64FX, Nvidia A100 and Nvidia H100 components, with mostly encouraging results.
In this work, we explore the feasibility of using performance portable parallel code for an important LQCD kernel. Fusing the Kokkos C++ Performance Portability EcoSystem with MPI allows to scale on massive parallel machines while still being able to target a plentitude of different architectures with the same simple code. We report on benchmarking results for a range of currently deployed and recently introduced systems, including AMD EPYC 7742, AMD MI250, Fujitsu A64FX, Nvidia A100 and Nvidia H100 components, with mostly encouraging results.
Workshop
Accelerators
Codesign
Heterogeneous Computing
Task Parallelism
W
DescriptionTransformer models suffer from high computational complexity. Habana GAUDI architecture offers a promising solution to tackle these issues. GAUDI features a Matrix Multiplication Engine (MME) and a cluster of fully programmable Tensor Processing Cores (TPC). This paper explores the untapped potential of using GAUDI processors to accelerate Transformer-based models, addressing key challenges in the process. First, we provide a performance comparison between the MME and TPC components, illuminating their relative strengths and weaknesses. Second, we explore strategies to optimize MME and TPC utilization, offering practical insights to enhance computational efficiency. Third, we evaluate the performance of Transformers on GAUDI, particularly in handling long sequences and uncovering performance bottlenecks. Last, we evaluate the end-to-end performance of two Transformer-based large language models (LLM) on GAUDI. The contributions of this work encompass practical insights for practitioners and researchers alike. We delve into GAUDI's capabilities for Transformers through systematic profiling, analysis, and optimization exploration.
Workshop
Education
State of the Practice
W
Tutorial
Cloud Computing
Software Engineering
TUT
DescriptionHigh Performance Computing in the cloud has grown significantly over the last five years. Weather, computational fluid dynamics (CFD), genomic analysis and more are workloads that leverage the elasticity and the broad compute choices of the cloud to innovate faster and deliver faster results. The large choice of compute, storage and network options and the dynamic nature of cloud can make the first experience a daunting proposition. Cloud technologies also provide new capabilities to scientist, engineer and HPC specialists, however, how to use them may not be immediately clear.
This tutorial provides an intermediate and advanced content to run and manage HPC in the cloud. It is organized in four series of progressive lectures and labs that provides a hands-on learning experience. It starts with a primer on cloud foundations and how they map to common HPC concepts, dives deeper into cloud core components and presents the best practices to run HPC in the cloud.
This tutorial uses a combination of lectures and hands-on labs on provided temporary Amazon Web Services (AWS) accounts to provide both conceptual and hands-on learning.
This tutorial provides an intermediate and advanced content to run and manage HPC in the cloud. It is organized in four series of progressive lectures and labs that provides a hands-on learning experience. It starts with a primer on cloud foundations and how they map to common HPC concepts, dives deeper into cloud core components and presents the best practices to run HPC in the cloud.
This tutorial uses a combination of lectures and hands-on labs on provided temporary Amazon Web Services (AWS) accounts to provide both conceptual and hands-on learning.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionMachine Learning (ML) has become an increasingly popular tool to accelerate traditional workflows. Critical to the use of ML is the process of splitting datasets into training, validation, and testing subsets to develop and evaluate models. Common practice is to assign these subsets randomly. Although this approach is fast, it only measures a model's capacity to interpolate. These testing errors may be overly optimistic on out-of-scope data; thus, there is a growing need to easily measure performance for extrapolation tasks. To address this issue, we report astartes, an open-source Python package that implements many similarity- and distance-based algorithms to partition data into more challenging splits. This poster focuses on use-cases within cheminformatics. However, astartes operates on arbitrary vectors, so its principals and workflow are generalizable to other ML domains as well. astartes is available via the Python package managers pip and conda and is publicly hosted on GitHub (github.com/JacksonBurns/astartes).
Tutorial
Applications
Software Engineering
TUT
DescriptionProducing scientific software is a challenge. The high-performance modeling and simulation community, in particular, faces the confluence of disruptive changes in computing architectures and new opportunities (and demands) for greatly improved simulation capabilities, especially through coupling physics and scales. Simultaneously, computational science and engineering (CSE), as well as other areas of science, are experiencing an increasing focus on scientific reproducibility and software quality. Code coupling requires aggregate team interactions including integration of software processes and practices. These challenges demand large investments in scientific software development and improved practices. Focusing on improved developer productivity and software sustainability is both urgent and essential.
Attendees will learn about practices, processes, and tools to improve the productivity of those who develop CSE software, increase the sustainability of software artifacts, and enhance trustworthiness in their use. We will focus on aspects of scientific software development that are not adequately addressed by resources developed for industrial software engineering. Topics include the design, refactoring, and testing of complex scientific software systems; collaborative software development; and software packaging. The second half of this full-day tutorial will focus on reproducibility, and why and how to keep a lab notebook for computationally-based research.
Attendees will learn about practices, processes, and tools to improve the productivity of those who develop CSE software, increase the sustainability of software artifacts, and enhance trustworthiness in their use. We will focus on aspects of scientific software development that are not adequately addressed by resources developed for industrial software engineering. Topics include the design, refactoring, and testing of complex scientific software systems; collaborative software development; and software packaging. The second half of this full-day tutorial will focus on reproducibility, and why and how to keep a lab notebook for computationally-based research.
Workshop
Quantum Computing
Software Engineering
W
DescriptionThe classical simulation of quantum computers is in general a computationally hard problem. To emulate the behavior of realistic devices, it is sufficient to sample bitstrings from circuits. Recently, Ref. [ 5] introduced the so-called gate-by-gate sampling algorithm to sample bitstrings and showed it to be computationally favorable in many cases. Here we present bgls, a Python package which implements this sampling algorithm. bgls has native support for several states and is highly flexible for use with additional states. We show how to install and use bgls, discuss optimizations in the algorithm, and demonstrate its utility on several problems.
ACM Gordon Bell Finalist
Awards
TP
DescriptionReal-time 30-second-refresh numerical weather prediction (NWP) was performed with exclusive use of 11,580 nodes (~7%) of supercomputer Fugaku during Tokyo Olympics and Paralympics in 2021. Total 75,248 forecasts were disseminated in the 1-month period mostly stably with time-to-solution less than 3 minutes for 30-minute forecast. Japan’s Big Data Assimilation (BDA) project developed the novel NWP system for precise prediction of hazardous rains toward solving the global climate crisis. Compared with typical 1-hour-refresh systems, the BDA system offered two orders of magnitude increase in problem size and revealed the effectiveness of 30-second refresh for highly nonlinear, rapidly evolving convective rains. To achieve the required time-to-solution for real-time 30-second refresh with high accuracy, the core BDA software incorporated single precision and enhanced parallel I/O with properly selected configurations of 1000 ensemble members and 500-m-mesh weather model. The massively parallel, I/O intensive real-time BDA computation demonstrated a promising future direction.
Paper
Artificial Intelligence/Machine Learning
TP
DescriptionDynamic graph networks are widely used for learning time-evolving graphs, but prior work on training these networks is inefficient due to communication overhead, long synchronization, and poor resource usage. Our investigation shows that communication and synchronization can be reduced by carefully scheduling the workload and the execution order of operators in GNNs can be adjusted without hurting training convergence.
We propose a system called BLAD to consider the above factors, comprising a two-level load scheduler and an overlap-aware topology manager. The scheduler allocates each snapshot group to a GPU, alleviating cross-GPU communication.
The snapshots in a group are then carefully allocated to processes on a GPU, enabling overlap of compute-intensive NN operators and memory-intensive graph operators. The topology manager adjusts the operators' execution order to maximize the overlap. Experiments show that it achieves 27.2% speed up on training time on average without affecting final accuracy, compared to state-of-the-art solutions.
We propose a system called BLAD to consider the above factors, comprising a two-level load scheduler and an overlap-aware topology manager. The scheduler allocates each snapshot group to a GPU, alleviating cross-GPU communication.
The snapshots in a group are then carefully allocated to processes on a GPU, enabling overlap of compute-intensive NN operators and memory-intensive graph operators. The topology manager adjusts the operators' execution order to maximize the overlap. Experiments show that it achieves 27.2% speed up on training time on average without affecting final accuracy, compared to state-of-the-art solutions.
Invited Talk
Applications
Biology
Medicine
TP
DescriptionNeuroscience has become a highly interdisciplinary research field, including among others purely experimental studies, applied technology development, mathematical theory, computational models and simulations, AI, visualization and data analysis. However, neuroscience is relatively new to the usage of High Performance Computing. Within the European Flagship project, the Human Brain Project, scientists from all around Europe have made substantial progress in consolidating the computational requirements and usage patterns of this heterogeneous field. In parallel with the evolution of the European HPC landscape, neuroscience has also helped co-design the federated access to HPC, cloud, and data resources through the ICEI project, in collaboration with the FENIX-RI – a European effort to provide federated access to some of the largest HPC centers in Europe.
In this talk, I will provide a general overview of the evolving relationships between neuroscience and HPC. I will also present some examples of scientific highlights which have been made possible by this interaction. Finally, I will provide a perspective of how neuroscience can contribute to future technology co-design keeping in focus societal impact. I will complement this talk with my personal story and international experiences.
In this talk, I will provide a general overview of the evolving relationships between neuroscience and HPC. I will also present some examples of scientific highlights which have been made possible by this interaction. Finally, I will provide a perspective of how neuroscience can contribute to future technology co-design keeping in focus societal impact. I will complement this talk with my personal story and international experiences.
Paper
Artificial Intelligence/Machine Learning
Applications
Modeling and Simulation
State of the Practice
TP
DescriptionMosaic Flow is a novel domain decomposition method designed to scale physics-informed neural PDE solvers to large domains. Its unique approach leverages pre-trained networks on small domains to solve partial differential equations on large domains purely through inference, resulting in high reusability. This paper presents an end-to-end parallelization of Mosaic Flow, combining data parallel training and domain parallelism for inference on large-scale problems. By optimizing the network architecture and data parallel training, we significantly reduce the training time for learning the Laplacian operator to minutes on 32 GPUs. Moreover, our distributed domain decomposition algorithm enables scalable inferences for solving the Laplace equation on domains 4096x larger than the training domain, demonstrating strong scaling while maintaining accuracy on 32 GPUs. The reusability of Mosaic Flow, combined with the improved performance achieved through the distributed-memory algorithms, makes it a promising tool for modeling complex physical phenomena and accelerating scientific discovery.
Workshop
Education
State of the Practice
W
DescriptionThe convergence of quantum technologies and high-performance computing offers unique opportunities for research and algorithm development, demanding a skilled workforce to harness the quantum systems' potential. In this lightning talk, we address the growing need to train experts in quantum computing and explore the challenges in training these individuals in quantum computing, including the abstract nature of quantum theory, or the focus on specific frameworks. To overcome these obstacles, we propose self-guided learning resources that offer interactive learning experiences and practical framework-independent experimentation for different target audiences.
Paper
Accelerators
Applications
Graph Algorithms and Frameworks
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
TP
DescriptionMany real-world computations involve sparse data structures in the form of sparse matrices. A common strategy for optimizing sparse matrix operations is to reorder a matrix to improve data locality. However, it's not always clear whether reordering will provide benefits over the unordered matrix, as its effectiveness depends on several factors, such as structural features of the matrix, the reordering algorithm and the hardware that is used. This paper aims to establish the relationship between matrix reordering algorithms and the performance of sparse matrix operations. We thoroughly evaluate six different matrix reordering algorithms on 490 matrices across eight multicore architectures, focusing on the commonly used sparse matrix-vector multiplication (SpMV) kernel. We find that reordering based on graph partitioning provides better SpMV performance than the alternatives for a large majority of matrices, and that the resulting performance is explained through a combination of data locality and load balancing concerns.
Invited Talk
Education
HPC in Society
TP
DescriptionAchievements in high-performance computing (HPC) ─ including computational and data-enabled science, analytics, learning, and artificial intelligence (AI) ─ drive progress in science and technology throughout our world. For example, collaborators in the U.S. Department of Energy (DOE) Exascale Computing Project (ECP) are pushing advances across a compelling range of scientific and engineering disciplines by pioneering a robust ecosystem of software technologies that exploit cutting-edge exascale computer architectures.
In order for the HPC community to address the most urgent scientific and societal challenges of the 21st century, the HPC workforce must embody a wide range of skills and perspectives … fully reflecting the diversity of society, including traditionally underrepresented communities — Black or African American, Hispanic/Latinx, Native American, Alaska Native, Native Hawaiian, Pacific Islanders, women, persons with disabilities, and first-generation scholars.
Each of us can make important contributions to broadening participation in HPC. This presentation will provide an overview of a variety of workforce efforts throughout the HPC community and opportunities for involvement. We will discuss the contributions of DOE lab staff who are working as part of the ECP Broadening Participation Initiative to address DOE workforce challenges through a lens that considers the distinct needs and culture of high-performance computing. Activities focus on three complementary thrusts: (1) Establishing an HPC Workforce Development and Retention Action Group to foster a supportive and inclusive culture in DOE labs and communities; (2) expanding the Sustainable Research Pathways (SRP) internship and workforce development program as a multi-lab cohort of students from underrepresented groups (and faculty working with them), who collaborate with DOE lab staff on world-class R&D projects; and (3) creating the Intro to HPC Bootcamp, an immersive program designed to engage students in energy justice using project-based pedagogy and real-life science stories to teach foundational skills in HPC, scalable AI, and analytics while exposing students to the excitement of DOE mission-driven team science. The presentation will highlight the first bootcamp (a collaboration among staff from advanced computing facilities at Argonne, Lawrence Berkeley, and Oak Ridge National Labs, Sustainable Horizons Institute, the DOE Office of Economic Impact and Diversity, and academic partners), which took place in August 2023 and featured a variety of HPC energy justice projects inspired by the DOE Justice40 Initiative. We will also consider challenges and opportunities for future work to broaden participation in HPC.
In order for the HPC community to address the most urgent scientific and societal challenges of the 21st century, the HPC workforce must embody a wide range of skills and perspectives … fully reflecting the diversity of society, including traditionally underrepresented communities — Black or African American, Hispanic/Latinx, Native American, Alaska Native, Native Hawaiian, Pacific Islanders, women, persons with disabilities, and first-generation scholars.
Each of us can make important contributions to broadening participation in HPC. This presentation will provide an overview of a variety of workforce efforts throughout the HPC community and opportunities for involvement. We will discuss the contributions of DOE lab staff who are working as part of the ECP Broadening Participation Initiative to address DOE workforce challenges through a lens that considers the distinct needs and culture of high-performance computing. Activities focus on three complementary thrusts: (1) Establishing an HPC Workforce Development and Retention Action Group to foster a supportive and inclusive culture in DOE labs and communities; (2) expanding the Sustainable Research Pathways (SRP) internship and workforce development program as a multi-lab cohort of students from underrepresented groups (and faculty working with them), who collaborate with DOE lab staff on world-class R&D projects; and (3) creating the Intro to HPC Bootcamp, an immersive program designed to engage students in energy justice using project-based pedagogy and real-life science stories to teach foundational skills in HPC, scalable AI, and analytics while exposing students to the excitement of DOE mission-driven team science. The presentation will highlight the first bootcamp (a collaboration among staff from advanced computing facilities at Argonne, Lawrence Berkeley, and Oak Ridge National Labs, Sustainable Horizons Institute, the DOE Office of Economic Impact and Diversity, and academic partners), which took place in August 2023 and featured a variety of HPC energy justice projects inspired by the DOE Justice40 Initiative. We will also consider challenges and opportunities for future work to broaden participation in HPC.
Exhibits
Flash Session
TP
XO/EX
DescriptionJoin speakers from NVIDIA and Arc Compute as they discuss solutions to the everyday challenges organizations face when building AI infrastructure and learn how Arc Compute's turnkey, end-to-end AI solutions, powered by NVIDIA GPUs and networking, are game changers helping decision-makers design, procure, and deploy their AI infrastructure.
Workshop
Data Movement and Memory
Heterogeneous Computing
Programming Frameworks and System Software
W
DescriptionWe propose a new framework called CachedArrays and a set of APIs to address the data tiering problem in large scale heterogeneous and disaggregated memory systems. The proposed framework operates at a variable size object granularity and allows the programmer to specify semantic hints about future use of data via a Policy API, which are used by a Data Manager to choose when and where to place a particular data object using a data management API, thus bridging the semantic gap between the programmer and the platform-specific hardware details, and optimizing overall performance. We evaluate the proposed framework on a real hardware platform with terabytes of memory consisting of NVRAM and DRAM on large scale ML training workloads such CNNs, DNNs, and DLRM that exhibit different data access and usage patterns.
Paper
Artificial Intelligence/Machine Learning
Codesign
Performance Optimization
Programming Frameworks and System Software
TP
DescriptionThis paper presents a parameterized analytical performance model of transformer-based Large Language Models (LLMs) for guiding high-level algorithm-architecture codesign studies. This model derives from an extensive survey of performance optimizations that have been proposed for the training and inference of LLMs; the model's parameters capture application characteristics, the hardware system, and the space of implementation strategies. With such a model, we can systematically explore a joint space of hardware and software configurations to identify optimal system designs under given constraints, like the total amount of system memory. We implemented this model and methodology in a Python-based open-source tool called Calculon. Using it, we identified novel system designs that look significantly different from current inference and training systems, showing quantitatively the estimated potential to achieve higher efficiency, lower cost, and better scalability.
Workshop
W
DescriptionThe ongoing revolution enabled via containerization, virtualization, and new orchestration models has dramatically changed how applications and services are delivered and managed across the computing industry. This revolution has established a new ecosystem of tools and techniques with new, flexible and agile approaches, and continues to gain traction in the HPC community. In addition to HPC-optimized container runtimes, emerging technologies like Kubernetes create a new set of opportunities and challenges. While adoption is growing, questions regarding best practices, foundational concepts, tools, and standards remain. Our goal is to promote the adoption of these tools and introspect the impact of this new ecosystem on HPC use cases. This workshop serves as a key venue for presenting late-breaking research, sharing experiences and best practices, and fostering collaboration in this field. Our fifth workshop iteration will continue to emphasize real-world experiences and challenges in adopting and optimizing these new approaches for HPC.
Workshop
Middleware and System Software
Programming Frameworks and System Software
Runtime Systems
W
DescriptionExtending Linux through kernel modules offers immense potential benefits and capabilities for HPC. Deployment is also more likely since Linux is typically the only supported vendor OS. However, because Linux is monolithic, kernel modules are free to access any address with maximum permissions. A poorly written---or untrustworthy---module can wreak havoc. This makes it hard to justify including custom kernel modules in production HPC systems. We address this limitation using the previously developed compiler- and runtime-based address translation (CARAT) model and toolchain, which injects guards around memory accesses. The accesses are then allowed/disallowed according to a policy. We share our results regarding the guard injection and address validation process. Our CARAT-based Kernel Object Protection (CARAT KOP) prototype is able to transform a substantial production kernel module from the kernel tree (a NIC driver comprising ~19,000 lines of code). The transformed module runs with minimal effect on its performance.
Panel
Energy Efficiency
Green Computing
Sustainability
TP
DescriptionWhat does it mean for computer systems to be sustainable? We have made significant improvements to operational efficiency in HPC systems. We now need to consider a broader scope of environmental impacts across the life cycle of our systems. This includes how they are designed and manufactured, how they are transported, how they are operated and how we are tearing them down, re-using and recycling them after they are no longer useful. These considerations may not be obvious. For example, manufacturing costs dominate the life cycle carbon footprint of systems and that trend is on the rise. How can we start to consider the carbon footprint across the end to end life cycle of our systems? We have a lot of capabilities to understand the performance, power and energy of our systems, but the same cannot be said for carbon footprint. Should carbon footprint be a first order optimization target?
Early Career Program
Inclusivity
Inclusivity
TP
DescriptionFinding the right career path early may be one of the most rewarding discoveries in a young professional's life. This panel discussion will feature insightful stories and kernels of wisdom of four panelists whose diverse careers span from start-ups to large companies, non-profit organizations to universities, and government labs to government agencies. They offer their practical wisdom to present a broader picture of the different workplaces in the HPC community. It will help young individuals to better match their strengths and objectives to the challenges and rewards of the different work places.
Workshop
Programming Frameworks and System Software
W
DescriptionWe present a new methodology and tool that speeds up the process of optimizing science and engineering programs. The tool, called CaRV (Capture, Replay, and Validate), enables users to experiment quickly with large applications, comparing individual program sections before and after optimizations in terms of efficiency and accuracy. Using language-level checkpointing techniques, CaRV captures the necessary data for replaying the experimental section as a separate execution unit after the code optimization and validating the optimization against the original program. The tool reduces the amount of time and resources spent on experimentation with long-running programs by up to two orders of magnitude, making program optimization more efficient and cost-effective.