Presentation
Chameleon: A Disaggregated CPU, GPU, and FPGA System for Retrieval-Augmented Language Models
SessionNinth International Workshop on Heterogeneous High-Performance Reconfigurable Computing (H2RC 2023)
DescriptionA Retrieval-Augmented Language Model (RALM) augments a generative language model by retrieving context-specific knowledge from an external database via vector search. This strategy facilitates impressive text generation quality even with smaller models, thus saving orders of magnitude of computational resources compared to large language models such as GPT4. However, RALMs introduce significant challenges to system designs due to the diverse workload characteristics of the different RALM components. In this presentation, we present Chameleon, a heterogeneous system that combines CPUs, GPUs, and FPGAs in a disaggregated manner for efficient RALM serving. While GPUs still manage the computationally-intensive model inference, we design a distributed CPU-FPGA engine for large-scale vector search requiring substantial memory capacity and rapid quantized vector decoding, with the CPU server managing the vector index and FPGA-based disaggregated memory nodes scanning database vectors using near-memory accelerators. Chameleon vector search achieves 8.6~29.4x and 1.6~57.9x lower latency than CPU and GPU-based systems.