Bit-Width Quantization and Prompt Optimization: Achieving 90% Energy Savings in Large Language Models

descriptionPublicationkeyboard_double_arrow_right Article 01 Feb 2026Publisher:ZenodoJournal:European Journal of Science, Innovation and Technology (issn: 2786-4936,

Copyright policy )

Authors: Dhakal, Anupam; Pokharel, Prashant; Adhikari, Sabin;

doi: 10.5281/zenodo.18616658 , 10.5281/zenodo.18616657

Bit-Width Quantization and Prompt Optimization: Achieving 90% Energy Savings in Large Language Models

- Summary
- Subjects
- Metrics

Abstract

For the rapidly evolving field of Large Language Models (LLMs, the rapid scaling has posed significant challenges. These problems include exorbitant energy consumption, prohibitively expensive deployment, and a significant impact on environmental sustainability. A major contributor to this problem is LLMs' colossal size. Typically, there are billions of parameters, and the need for them to be run in resource-scarce or edge environments. Our research delves into a functional and immediately applicable solution to kickstart the energy efficiency of LLMs by merging low-bit-width quantization and streamlined prompt techniques. We have tested this approach with Llama-based models ranging from hundreds of millions to over one billion parameters and applied 4-bit post-training compression combined with structured prompt and query optimization to this spectrum of models. Utilizing a well-controlled A/B testing framework, we evaluated the task accuracy, delay, and power consumption between our baseline and optimized configurations. Since we can measure the actual power usage of our hardware, we could use the formula accuracy-per-watt to sum up the performance of both configurations. Our results show that 4-bit compression all by itself knocks out a significant portion of memory usage and electricity consumption, and then, our fine-tuning of the prompts cuts down the cost of token-level inference. When used in tandem, these two techniques have led to a 90% reduction in energy consumption with virtually no or statistically insignificant losses in accuracy on the tests we ran. We also verified the effectiveness of this strategy for real-world use, demonstrating that it delivers consistent efficiency benefits when running on severely constrained hardware. The scalability analysis showed that this method still delivers a lot of bang for the buck even for models that have over a billion parameters.

Keywords

Large Language Models, Bit-Width Quantization, Edge Deployment, Prompt Optimization, Energy-Efficient AI

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now