Presentation
Optimizing MPI Collectives on Shared Memory Multi-Cores
SessionMessage Passing Innovations
DescriptionCollective communication operations, such as broadcasting and reductions, often contribute to performance bottlenecks in Message Passing Interface (MPI) programs. As the number of processor cores integrated into CPUs increases, running multiple MPI processes on shared-memory machines to leverage hardware parallelism is becoming increasingly common. In this context, optimizing MPI collective communications for shared-memory execution is crucial. This paper identifies two primary limitations of existing MPI collective implementations on shared-memory systems. The first is the extensive redundant data movements when performing reduction collectives, and the second is the ineffective use of non-temporal instructions to optimize streamed data processing. To address these challenges, we propose two optimization techniques designed to minimize data movements and enhance the use of non-temporal instructions. We integrate our optimizations into the OpenMPI and evaluate their performance through micro-benchmarks and real-world application tests on two multi-core clusters. Experiments show that our approach significantly outperforms existing techniques by 1.2-6.4x.