BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20240116T191703Z
LOCATION:403-404
DTSTART;TZID=America/Denver:20231116T103000
DTEND;TZID=America/Denver:20231116T110000
UID:submissions.supercomputing.org_SC23_sess178_pap582@linklings.com
SUMMARY:Optimizing Direct Convolutions on ARM Multi-Cores
DESCRIPTION:Paper\n\nPengyu Wang, Weiling Yang, Jianbin Fang, Dezun Dong, 
 Chun Huang, Peng Zhang, and Tao Tang (National University of Defense Techn
 ology (NUDT), China) and Zheng Wang (University of Leeds, School of Comput
 ing, UK)\n\nConvolution kernels are widely seen in deep learning workloads
  and are often responsible for performance bottlenecks. Recent research ha
 s demonstrated that a direct convolution approach can outperform the tradi
 tional convolution implementation based on tensor-to-matrix conversions. H
 owever, existing approaches for direct convolution still have room for per
 formance improvement. We present NDIRECT, a new direct convolution approac
 h that targets ARM-based multi-core CPUs commonly found in smartphones and
  HPC systems. NDIRECT is designed to be compatible with the data layout fo
 rmats used by mainstream deep learning frameworks but offers new optimizat
 ions for the computational kernel, data packing, and parallelization. We e
 valuate NDIRECT by applying it to representative convolution kernels and d
 emonstrating its performance on four distinct ARM multi-core CPU platforms
 . We compare NDIRECT against state-of-the-art convolution optimization tec
 hniques. Experimental results show that NDIRECT gives the best overall per
 formance across evaluation scenarios and platforms.\n\nTag: Artificial Int
 elligence/Machine Learning, Codesign, Performance Optimization, Programmin
 g Frameworks and System Software\n\nRegistration Category: Tech Program Re
 g Pass\n\nReproducibility Badges: Artifact Available\n\nSession Chair: Apa
 rna Chandramowlishwaran (University of California, Irvine)
END:VEVENT
END:VCALENDAR
