Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Other literature type . 2026
Data sources: Datacite
ZENODO
Other literature type . 2026
Data sources: Datacite
ZENODO
Other literature type . 2026
Data sources: Datacite
ZENODO
Other literature type . 2026
Data sources: Datacite
versions View all 4 versions
addClaim

On-Device Large Language Model Inference at the Network Edge: Architecture, Optimization, and Cross-Platform Runtime Design

Authors: Shirokikh, Emil;

On-Device Large Language Model Inference at the Network Edge: Architecture, Optimization, and Cross-Platform Runtime Design

Abstract

This technical report, prepared during the Stanford Ignite Program, presents a comprehensive framework and system architecture (SlyOS) for executing large language models (LLMs) directly on consumer hardware at the network edge. The rapid growth of large language models (LLMs) has produced remarkable gains in natural language understanding, generation, and reasoning. However, the computational and financial burden of cloud-hosted inference creates a fundamental bottleneck for latency-sensitive applications, bandwidth-constrained environments, and privacy-conscious deployments. This report presents a comprehensive treatment of on-device LLM inference—the practice of executing transformer-based language models directly on consumer hardware at the network edge, eliminating the round-trip to centralized GPU clusters. We introduce a cross-platform runtime architecture that unifies inference execution across heterogeneous edge devices spanning iOS, Android, and web browsers through a single model artifact format and a shared abstraction over hardware-specific acceleration primitives (Core ML, NNAPI, WebAssembly SIMD). The system implements aggressive 4-bit post-training quantization with activation-aware calibration, achieving model footprints between 0.26 and 3.7 gigabytes. Published quantization studies demonstrate that this level of compression preserves generation quality within 0.1-0.2 perplexity points of full-precision baselines for 7B-class models. We describe a device intelligence layer that constructs hardware capability profiles encompassing compute throughput, thermal characteristics, available memory, and GPU architecture to drive automatic model selection, batch size calibration, and execution provider routing. The report further develops a hybrid retrieval-augmented generation (RAG) pipeline that augments on-device generation with server-This technical report, prepared during the Stanford Ignite Program, presents a comprehensive framework and system architecture (SlyOS) for executing large language models (LLMs) directly on consumer hardware at the network edge. The rapid growth of large language models (LLMs) has produced remarkable gains in natural language understanding, generation, and reasoning. However, the computational and financial burden of cloud-hosted inference creates a fundamental bottleneck for latency-sensitive applications, bandwidth-constrained environments, and privacy-conscious deployments. This report presents a comprehensive treatment of on-device LLM inference—the practice of executing transformer-based language models directly on consumer hardware at the network edge, eliminating the round-trip to centralized GPU clusters. We introduce a cross-platform runtime architecture that unifies inference execution across heterogeneous edge devices spanning iOS, Android, and web browsers through a single model artifact format and a shared abstraction over hardware-specific acceleration primitives (Core ML, NNAPI, WebAssembly SIMD). The system implements aggressive 4-bit post-training quantization with activation-aware calibration, achieving model footprints between 0.26 and 3.7 gigabytes. Published quantization studies demonstrate that this level of compression preserves generation quality within 0.1-0.2 perplexity points of full-precision baselines for 7B-class models. We describe a device intelligence layer that constructs hardware capability profiles encompassing compute throughput, thermal characteristics, available memory, and GPU architecture to drive automatic model selection, batch size calibration, and execution provider routing. The report further develops a hybrid retrieval-augmented generation (RAG) pipeline that augments on-device generation with server-side vector retrieval over domain-specific knowledge bases, enabling factual grounding without transmitting raw user queries.

Keywords

edge computing, ONNX Runtime, retrieval-augmented generation, mobile AI, large language models, SlyOS, model quantization, on-device inference, cross-platform runtime

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!