On-Device Large Language Model Inference at the Network Edge: Architecture, Optimization, and Cross-Platform Runtime Design

This technical report, prepared during the Stanford Ignite Program, presents a comprehensive framework and system architecture (SlyOS) for executing large language models (LLMs) directly on consumer hardware at the network edge. The rapid growth of large language models (LLMs) has produced remarkable gains in natural language understanding, generation, and reasoning. However, the computational and financial burden of cloud-hosted inference creates a fundamental bottleneck for latency-sensitive applications, bandwidth-constrained environments, and privacy-conscious deployments. This report presents a comprehensive treatment of on-device LLM inference—the practice of executing transformer-based language models directly on consumer hardware at the network edge, eliminating the round-trip to centralized GPU clusters. We introduce a cross-platform runtime architecture that unifies inference execution across heterogeneous edge devices spanning iOS, Android, and web browsers through a single model artifact format and a shared abstraction over hardware-specific acceleration primitives (Core ML, NNAPI, WebAssembly SIMD). The system implements aggressive 4-bit post-training quantization with activation-aware calibration, achieving model footprints between 0.26 and 3.7 gigabytes. Published quantization studies demonstrate that this level of compression preserves generation quality within 0.1-0.2 perplexity points of full-precision baselines for 7B-class models. We describe a device intelligence layer that constructs hardware capability profiles encompassing compute throughput, thermal characteristics, available memory, and GPU architecture to drive automatic model selection, batch size calibration, and execution provider routing. The report further develops a hybrid retrieval-augmented generation (RAG) pipeline that augments on-device generation with server-This technical report, prepared during the Stanford Ignite Program, presents a comprehensive framework and system architecture (SlyOS) for executing large language models (LLMs) directly on consumer hardware at the network edge. The rapid growth of large language models (LLMs) has produced remarkable gains in natural language understanding, generation, and reasoning. However, the computational and financial burden of cloud-hosted inference creates a fundamental bottleneck for latency-sensitive applications, bandwidth-constrained environments, and privacy-conscious deployments. This report presents a comprehensive treatment of on-device LLM inference—the practice of executing transformer-based language models directly on consumer hardware at the network edge, eliminating the round-trip to centralized GPU clusters. We introduce a cross-platform runtime architecture that unifies inference execution across heterogeneous edge devices spanning iOS, Android, and web browsers through a single model artifact format and a shared abstraction over hardware-specific acceleration primitives (Core ML, NNAPI, WebAssembly SIMD). The system implements aggressive 4-bit post-training quantization with activation-aware calibration, achieving model footprints between 0.26 and 3.7 gigabytes. Published quantization studies demonstrate that this level of compression preserves generation quality within 0.1-0.2 perplexity points of full-precision baselines for 7B-class models. We describe a device intelligence layer that constructs hardware capability profiles encompassing compute throughput, thermal characteristics, available memory, and GPU architecture to drive automatic model selection, batch size calibration, and execution provider routing. The report further develops a hybrid retrieval-augmented generation (RAG) pipeline that augments on-device generation with server-side vector retrieval over domain-specific knowledge bases, enabling factual grounding without transmitting raw user queries.

Keywords

edge computing, ONNX Runtime, retrieval-augmented generation, mobile AI, large language models, SlyOS, model quantization, on-device inference, cross-platform runtime

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now