Towards Pareto Optimal Throughput in Small Language Model Serving

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 22 Apr 2024Embargo end date: 01 Jan 2024Publisher:ACMJournal:Proceedings of the 4th Workshop on Machine Learning and SystemsFunded by:EC | CLOUDSTARS

Authors: Pol G. Recasens; Yue Zhu; Chen Wang; Eun Kyung Lee; Olivier Tardieu; Alaa Youssef; Jordi Torres; +1 Authors

doi: 10.1145/3642970.3655832 , 10.48550/arxiv.2404.03353

arXiv: 2404.03353

handle: 2117/428295 , 2117/409824

Towards Pareto Optimal Throughput in Small Language Model Serving

- Summary
- Subjects
- Metrics

Abstract

Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small Language Models (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs.

Revised version of the paper published at EuroMLSys'24, fix figure 6 and 7

Related Organizations

Universitat Politècnica de Catalunya
Spain
Barcelona Supercomputing Center
Spain
Universitat Polite`cnica de Catalunya
Spain

Keywords

FOS: Computer and information sciences, inference optimization, Language models, Assignació de recursos, Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors, Natural language processing (Computer science), language models, Pareto optimisation, High performance computing, Tractament del llenguatge natural (Informàtica), Computation and Language, Resource allocation, Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Llenguatge natural, Càlcul intensiu (Informàtica), Computation and Language (cs.CL), Inference optimization

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	7
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%