
This technical report, prepared during the Stanford Ignite Program, presents a comprehensive framework and system architecture (SlyOS) for executing large language models (LLMs) directly on consumer hardware at the network edge. The rapid growth of large language models (LLMs) has produced remarkable gains in natural language understanding, generation, and reasoning. However, the computational and financial burden of cloud-hosted inference creates a fundamental bottleneck for latency-sensitive applications, bandwidth-constrained environments, and privacy-conscious deployments. This report presents a comprehensive treatment of on-device LLM inference—the practice of executing transformer-based language models directly on consumer hardware at the network edge, eliminating the round-trip to centralized GPU clusters. We introduce a cross-platform runtime architecture that unifies inference execution across heterogeneous edge devices spanning iOS, Android, and web browsers through a single model artifact format and a shared abstraction over hardware-specific acceleration primitives (Core ML, NNAPI, WebAssembly SIMD). The system implements aggressive 4-bit post-training quantization with activation-aware calibration, achieving model footprints between 0.26 and 3.7 gigabytes. Published quantization studies demonstrate that this level of compression preserves generation quality within 0.1-0.2 perplexity points of full-precision baselines for 7B-class models. We describe a device intelligence layer that constructs hardware capability profiles encompassing compute throughput, thermal characteristics, available memory, and GPU architecture to drive automatic model selection, batch size calibration, and execution provider routing. The report further develops a hybrid retrieval-augmented generation (RAG) pipeline that augments on-device generation with server-This technical report, prepared during the Stanford Ignite Program, presents a comprehensive framework and system architecture (SlyOS) for executing large language models (LLMs) directly on consumer hardware at the network edge. The rapid growth of large language models (LLMs) has produced remarkable gains in natural language understanding, generation, and reasoning. However, the computational and financial burden of cloud-hosted inference creates a fundamental bottleneck for latency-sensitive applications, bandwidth-constrained environments, and privacy-conscious deployments. This report presents a comprehensive treatment of on-device LLM inference—the practice of executing transformer-based language models directly on consumer hardware at the network edge, eliminating the round-trip to centralized GPU clusters. We introduce a cross-platform runtime architecture that unifies inference execution across heterogeneous edge devices spanning iOS, Android, and web browsers through a single model artifact format and a shared abstraction over hardware-specific acceleration primitives (Core ML, NNAPI, WebAssembly SIMD). The system implements aggressive 4-bit post-training quantization with activation-aware calibration, achieving model footprints between 0.26 and 3.7 gigabytes. Published quantization studies demonstrate that this level of compression preserves generation quality within 0.1-0.2 perplexity points of full-precision baselines for 7B-class models. We describe a device intelligence layer that constructs hardware capability profiles encompassing compute throughput, thermal characteristics, available memory, and GPU architecture to drive automatic model selection, batch size calibration, and execution provider routing. The report further develops a hybrid retrieval-augmented generation (RAG) pipeline that augments on-device generation with server-side vector retrieval over domain-specific knowledge bases, enabling factual grounding without transmitting raw user queries.
edge computing, ONNX Runtime, retrieval-augmented generation, mobile AI, large language models, SlyOS, model quantization, on-device inference, cross-platform runtime
edge computing, ONNX Runtime, retrieval-augmented generation, mobile AI, large language models, SlyOS, model quantization, on-device inference, cross-platform runtime
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
