A Reinforcement Learning-Based Backfilling Strategy for HPC Batch Jobs
DescriptionHPC systems employ a scheduling technique called “backfilling”, wherein low-priority jobs are scheduled earlier to use the available resources that are waiting for the pending high-priority jobs. Backfilling relies on job runtime to calculate the start time of the ready-to-schedule jobs and avoid delaying them. It is a common belief that better estimations of job runtime will lead to better backfilling and more effective scheduling. However, our experiments show a different conclusion: there is a missing trade-off between prediction accuracy and backfilling opportunities. To learn how to achieve the best trade-off, we believe reinforcement learning (RL) can be effectively leveraged. Based on this idea, we designed RLBackfilling, a reinforcement learning based backfilling algorithm. Our evaluation results show up to 17x better scheduling performance compared to EASY backfilling using user-provided job runtime and 4.7x better performance comparing with EASY using the ideal predicted job runtime (the actual job runtime).