BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20240116T191658Z
LOCATION:DEF Concourse
DTSTART;TZID=America/Denver:20231114T100000
DTEND;TZID=America/Denver:20231114T170000
UID:submissions.supercomputing.org_SC23_sess291_rpost196@linklings.com
SUMMARY:Balancing Latency and Throughput of Distributed Inference by Inter
 leaved Parallelism
DESCRIPTION:Posters, Research Posters\n\nJiangsu Du, Jinhui Wei, and Jiazh
 i Jiang (Sun Yat-sen University, Guangzhou); Shenggan Cheng (National Univ
 ersity of Singapore); and Zhiguang Chen, Dan Huang, and Yutong Lu (Sun Yat
 -sen University, Guangzhou)\n\nDistributed large model inference is still 
 in a dilemma where balancing latency and throughput, or rather cost and ef
 fect. Tensor parallelism, while capable of optimizing latency, entails a s
 ubstantial expenditure. Conversely, pipeline parallelism excels in through
 put but falls short in minimizing execution time.\n\nTo address this chall
 enge, we introduce a novel solution - interleaved parallelism. This approa
 ch interleaves computation and communication across requests. Our proposed
  runtime system harnesses GPU scheduling techniques to facilitate the over
 lapping of communication and computation kernels, thereby enabling this pi
 oneering parallelism for distributed large model inference. Extensive eval
 uations show that our proposal outperforms existing parallelism approaches
  across models and devices, presenting the best latency and throughput in 
 most cases.\n\nRegistration Category: Tech Program Reg Pass, Exhibits Reg 
 Pass
END:VEVENT
END:VCALENDAR
