Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2023Embargo end date: 01 Jan 2023Publisher:arXivJournal:CoRR, volume abs/2301.12017

Authors: Xiaoxia Wu; Cheng Li 0001; Reza Yazdani Aminabadi; Zhewei Yao; Yuxiong He;

doi: 10.48550/arxiv.2301.12017

arXiv: 2301.12017

Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

- Summary
- Subjects
- Related research
  (8)
- Metrics

Abstract

Improving the deployment efficiency of transformer-based language models has been challenging given their high computation and memory cost. While INT8 quantization has recently been shown to be effective in reducing both the memory cost and latency while preserving model accuracy, it remains unclear whether we can leverage INT4 (which doubles peak hardware throughput) to achieve further latency improvement. In this study, we explore the feasibility of employing INT4 weight and activation (W4A4) quantization for language models. Our findings indicate that W4A4 quantization introduces no to negligible accuracy degradation for encoder-only and encoder-decoder models, but causes a significant accuracy drop for decoder-only models. To materialize the performance gain using W4A4, we develop a highly optimized end-to-end W4A4 encoder inference pipeline supporting different quantization strategies. Our INT4 pipeline is $8.5\times$ faster for latency-oriented scenarios and up to $3\times$ for throughput-oriented scenarios compared to the inference of FP16, and improves the SOTA BERT INT8 performance from FasterTransformer by up to $1.7\times$. We provide insights into the failure cases when applying W4A4 to decoder-only models, and further explore the compatibility of INT4 quantization with other compression methods, like pruning and layer reduction.

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)

8 Research products, page 1 of 1

Source Coding With Distortion Side Information
2008IsAmongTopNSimilarDocuments
LightSeq2: Accelerated Training for Transformer-Based Models on GPUs
2022IsAmongTopNSimilarDocuments
Denoising based Sequence-to-Sequence Pre-training for Text Generation
2019IsAmongTopNSimilarDocuments
Source Coding With Encoder Side Information
2004IsAmongTopNSimilarDocuments
Compare Encoder-Decoder, Encoder-Only, and Decoder-Only Architectures for Text Generation on Low-Resource Datasets
2021IsAmongTopNSimilarDocuments
DeepSpeed software on GitHub
IsRelatedTo
FasterTransformer software on GitHub
IsRelatedTo
dq-bart software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Fields of Science (4) View all

natural sciences

Fields of Science

natural sciences

View all