Name: MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines
Keywords: Machine Learning, FOS: Computer and information sciences, Distributed, Parallel, and Cluster Computing, Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning (cs.LG)

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2025Embargo end date: 01 Jan 2024Publisher:Association for Computational Linguistics (ACL)Journal:Proceedings of the 2025 Conference on Empirical Methods in Natural Language ProcessingFunded by:NSF | SHF: Small: ML Accelerato...

Authors: Gao, Lei; Ziashahabi, Amir; Niu, Yue; Avestimehr, Salman; Annavaram, Murali;

doi: 10.18653/v1/2025.emnlp-main.1022 , 10.48550/arxiv.2409.15520

arXiv: 2409.15520

MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. The next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Given the sensitive nature of such private data, it is desirable to fine-tune these models on edge devices to improve user trust. However, fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands, as well as limited infrastructure support. We observe that inference engines (e.g., ExecuTorch) can be repurposed for fine-tuning by leveraging zeroth-order (ZO) optimization, which uses multiple forward passes to approximate gradients. While promising, direct application of ZO methods on edge devices is inefficient due to the high computational cost of multiple forward passes required for accurate gradient estimation, and their deployment has been largely unexplored in practice. We introduce MobiZO, a resource-efficient fine-tuning framework for LLMs specifically designed for edge devices. MobiZO combines three key innovations: (1) a parallelized randomized gradient estimator that employs both outer-loop and inner-loop parallelism to eliminate sequential forward passes, (2) a specialized Multi-Perturbed LoRA (MP-LoRA) module that enables efficient realization of both inner and outer loop parallelism, and (3) a seamless integration with ExecuTorch for on-device training, requiring no modifications to the runtime. Experiments demonstrate that MobiZO achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy, paving the way for practical deployment of LLMs in real-time, on-device applications.

Related Organizations

View all View all

Keywords

Machine Learning, FOS: Computer and information sciences, Distributed, Parallel, and Cluster Computing, Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning (cs.LG)

1 Research products, page 1 of 1

PRGE software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

Green

Funded by

NSF| SHF: Small: ML Accelerator Cohort Architecture

MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines

MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines

1 Research products, page 1 of 1

PRGE software on GitHub