Presentation
Dynamic Memory Provisioning on Disaggregated HPC Systems
DescriptionDisaggregated memory intends to break the rigid boundaries between node memory hierarchies by providing memory as a pooled resource. The resource manager allocates system’s memory at job’s submission time. But it is hard for users to know the job's precise peak memory footprint, and prior work has shown users have an incentive to overestimate. It leads to significant overallocation, and most of the physical memory in the system is wasted. We present a way to reclaim much of this overallocated memory. We extend the Slurm job scheduler to dynamically reallocate memory, according to the job’s current memory footprint. We enhance an existing Slurm simulator to model this situation and combine publicly available traces to model an HPC system on up to 1490 nodes. We show that dynamic memory provisioning approach increases the throughput per dollar by up to 38%, compared to a system with static allocation of disaggregated memory.