
Deploying large language model (LLM) inference has become a core challenge in artificial intelligence.Current mainstream inference frameworks are primarily optimized for datacenter-grade hardware, withrelatively insufficient support for consumer-grade multi-GPU systems. This paper systematicallysurveys key techniques in LLM inference optimization, including FlashAttention, PagedAttention, KVCache management, and multi-GPU parallelism strategies, while providing an in-depth analysis of thetechnical characteristics and limitations of existing inference frameworks. Building upon thisfoundation, we propose and implement Ember—a lightweight CUDA inference engine specificallyoptimized for consumer-grade multi-GPU systems. Ember employs a Pipeline Parallelism strategy withChunked Prefill with Overlap technique to reduce PCIe communication exposure. Experimental resultsdemonstrate that on a dual NVIDIA RTX 3080 Ti configuration, Ember achieves a 1.16× dual-GPUspeedup ratio, while llama.cpp's Layer Split strategy only achieves 1.01× speedup, validating theeffectiveness of our proposed approach. Compared to ExLlamaV3's Tensor Parallel strategy, Ember'sPipeline Parallel approach achieves comparable scaling efficiency while maintaining lower time-to-first-token latency growth. This research provides practical guidance for LLM inference on consumer-grade hardware and demonstrates the commercial application potential in this field.
Large Language Model, Multi-GPU, Pipeline Parallelism, LLM Inference, CUDA
Large Language Model, Multi-GPU, Pipeline Parallelism, LLM Inference, CUDA
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
