Close

Presentation

BLAD: Adaptive Load Balanced Scheduling and Operator Overlap Pipeline for Accelerating the Dynamic GNN Training
DescriptionDynamic graph networks are widely used for learning time-evolving graphs, but prior work on training these networks is inefficient due to communication overhead, long synchronization, and poor resource usage. Our investigation shows that communication and synchronization can be reduced by carefully scheduling the workload and the execution order of operators in GNNs can be adjusted without hurting training convergence.

We propose a system called BLAD to consider the above factors, comprising a two-level load scheduler and an overlap-aware topology manager. The scheduler allocates each snapshot group to a GPU, alleviating cross-GPU communication.
The snapshots in a group are then carefully allocated to processes on a GPU, enabling overlap of compute-intensive NN operators and memory-intensive graph operators. The topology manager adjusts the operators' execution order to maximize the overlap. Experiments show that it achieves 27.2% speed up on training time on average without affecting final accuracy, compared to state-of-the-art solutions.
Event Type
Paper
TimeTuesday, 14 November 20233:30pm - 4pm MST
Location301-302-303
Tags
Artificial Intelligence/Machine Learning
Registration Categories
TP
Reproducibility Badges