BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20240116T185912Z
LOCATION:E Concourse
DTSTART;TZID=America/Denver:20231114T100000
DTEND;TZID=America/Denver:20231114T170000
UID:submissions.supercomputing.org_SC23_sess290_drs103@linklings.com
SUMMARY:Overcoming the Gap between Compute and Memory Bandwidth in Modern 
 GPUs
DESCRIPTION:Doctoral Showcase, Posters\n\nLingqi Zhang (Tokyo Institute of
  Technology)\n\nThe imbalance between compute and memory bandwidth has bee
 n a long-standing issue. Despite efforts to address it, the gap between th
 em is still widening. This has led to the categorization of many applicati
 ons as memory-bound kernels.	\n\nThis dissertation centers on memory-bound
  kernels, with a particular emphasis on Graphics Processing Units (GPUs), 
 given their rising prevalence in High-Performance Computing (HPC) systems.
  \n\nIn this dissertation, we initially focus on the evolution trend of GP
 U development in the last decades. Examples include cooperative groups (i.
 e., device-wide barriers), asynchronous copy of shared memory (i.e., hardw
 are prefetching), low(er) latency of operations, and larger volume of on-c
 hip resources (register files and L1 cache).\n\nThis dissertation seeks to
  utilize the latest GPU features to optimize memory-bound kernels. Specifi
 cally, we propose extending the kernel's lifetime across the time steps an
 d taking advantage of the large volume of on-chip resources (i.e., registe
 r files and scratchpad memory) in reducing or eliminating traffic to the d
 evice memory. Furthermore, we champion a minimum level of parallelism to m
 aximize the available on-chip resources.	\n\nBased on the strategies, we p
 ropose a general execution model for running memory-bound iterative GPU ke
 rnels: PERsistent KernelS (PERKS) and a novel temporal blocking method, EB
 ISU. Evaluations have shown outstanding performance in the latest GPU arch
 itectures compared with counterpart state-of-the-art implementations.\n\nT
 ag: Accelerators\n\nRegistration Category: Tech Program Reg Pass, Exhibits
  Reg Pass
END:VEVENT
END:VCALENDAR