Downloads provided by UsageCounts
arXiv: 2404.03353
handle: 2117/428295 , 2117/409824
Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small Language Models (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs.
Revised version of the paper published at EuroMLSys'24, fix figure 6 and 7
FOS: Computer and information sciences, inference optimization, Language models, Assignació de recursos, Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors, Natural language processing (Computer science), language models, Pareto optimisation, High performance computing, Tractament del llenguatge natural (Informàtica), Computation and Language, Resource allocation, Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Llenguatge natural, Càlcul intensiu (Informàtica), Computation and Language (cs.CL), Inference optimization
FOS: Computer and information sciences, inference optimization, Language models, Assignació de recursos, Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors, Natural language processing (Computer science), language models, Pareto optimisation, High performance computing, Tractament del llenguatge natural (Informàtica), Computation and Language, Resource allocation, Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Llenguatge natural, Càlcul intensiu (Informàtica), Computation and Language (cs.CL), Inference optimization
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 7 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
| views | 54 | |
| downloads | 31 |

Views provided by UsageCounts
Downloads provided by UsageCounts