Presentation

· Contributors · Organizations · Search Program · My Schedule · Happening Now · Maps

Extra-Deep: Automated Empirical Performance Modeling for Distributed Deep Learning

Session5th Workshop on Programming and Performance Visualization Tools (ProTools 2023)

DescriptionWith the rapidly increasing size and complexity of DNNs, equally sophisticated methods are needed to train them efficiently, including distributed training and various model/hybrid parallelism approaches. Even though developers heavily rely on state-of-the-art frameworks such as PyTorch and TensorFlow, these provide little insight into an application's training behavior at scale, leading to latent performance bottlenecks and inefficient training configurations. We propose Extra-Deep, an automated empirical performance modeling approach for distributed deep learning. We leverage the created models to analyze a training task's performance, scalability, efficiency, and cost. Using an efficient sampling strategy that reduces the profiling time for the required empirical measurements by, on average, about 94.9%, we can identify cost-effective training configurations even for large-scale applications. We evaluated our approach on three parallelization strategies, with four DNN models and five datasets. The results show that Extra-Deep has an average prediction accuracy of 93.6% when compared to empirical results.

Author/Presenters

Marcus Ritter

Technical University of Darmstadt

Felix Wolf

Technical University of Darmstadt

Event Type

Workshop

TimeSunday, 12 November 202312:06pm - 12:30pm MST

Location710