Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ Recolector de Cienci...arrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
https://doi.org/10.4995/thesis...
Doctoral thesis . 2025 . Peer-reviewed
Data sources: Crossref
ZENODO
Thesis . 2025
License: CC BY
Data sources: Datacite
ZENODO
Thesis . 2025
License: CC BY
Data sources: Datacite
versions View all 4 versions
addClaim

Efficient Mixed-Precision Inference for Vision Transformers

Authors: Kluska, Piotr;

Efficient Mixed-Precision Inference for Vision Transformers

Abstract

Recent advances in deep learning (DL) have been achieved by scaling the number of model parameters and training data. The Transformer is the most widely used DL model architecture in natural language processing. Its key feature is the attention mechanism, which allows the model to focus dynamically on the relevant context in the global latent space. This has led to scaling models in natural language processing with up to 405 billion parameters, equivalent to 810 gigabytes (GB) of memory in 16-bit floating point (FP16). At this level of accuracy, a model of this size requires several accelerator compute nodes. The Transformer architecture has been successfully transferred to the computer vision domain as the Vision Transformer (ViT). Similarly, scaling the number of parameters in the ViT architecture increased the model’s predictive power. In addition, modifications to the ViT architecture led to new architectures such as the Data efficient Image Transformer (DeiT), Swin Transformer, and DeiT3. Each of these improved upon the shortcomings of the original architecture. Nevertheless, these models require substantial computing power and energy to process the images at scale. The emergence of Artificial Intelligence (AI) systems and applications will require DL models to run efficiently with a conscious energy consumption, as it is expected that, by 2040, the energy consumed by devices will exceed our energy production capability. To tackle this, we investigate the compression methods for ViT architectures to reduce the model’s required size and translate the computation to more energy-efficient data types. The methods and algorithms presented in this dissertation allow the ViT architectures to operate with less energy consumed and reduced latency at a similar level of predictive performance to the reference model. Firstly, we comprehensively evaluate the effect of the post-training quantization on the ViT, DeiT, Swin Transformer, and DeiT3 models. We show that ViT and DeiT3, after quantization, lose their predictive power, while DeiT and Swin do not. We hypothesize that the regularization applied during training positively affects the quantization’s robustness. Next, we perform per-layer analysis using the signal-to-quantization-noise ratio (SQNR), which measures the latent signal going through a quantized network compared to the reference FP32 DL model. We show that a correlation exists between the SQNR value and quantization error. Moreover, we propose an easy yet effective post-training quantization method that utilizes mixed-precision computation, allowing us to compress the models up to 90%. As a result, our models approach the fully quantized DL models in size while keeping the predictive performance close to the FP32. Next, we propose a novel post-training quantization method - Hybrid Quantization (HQ). HQ uses the property of the ViTs, which are mainly composed of linear layers. As a result, we design an automatic algorithm that selects the linear layer for static or dynamic quantization based on the SQNR metric. Our HQ method improves the predictive power performance compared to the static quantization in 12/12 ViT, 3/6 DeiT, 6/6 DeiT3, and 6/6 Swin Transformer models on the ImageNet1K validation dataset. Furthermore, we evaluate the latency of HQ models on three hardware environments: an Intel Xeon 5218 Gold CPU, a mobile Apple A15 Bionic CPU, and an NVIDIA A100 GPU. We observe up to 1.15, 1.28, and 1.68 average speedups compared to the dynamic quantization for ViT models, respectively. Lastly, we design and implement a mixed-precision attention mechanism in the Triton language. Our mixed-precision attention mixes an 8-bit integer (INT8) and FP16 computation to achieve higher throughput and numerical stability than the reference implementation of FlashAttention in Triton. We show that using a domain-specific language with a compiler can match heavily specialized GPU kernels. Moreover, we open-source our QAttn framework. In this library, we focus on the integration of the PyTorch post-training quantization ecosystem. We extend the PyTorch quantization with custom kernels for quantized matrix multiplication and our mixed-precision attention. We show that our kernels improve the throughput of the ViT model by up to 7.34 compared to the FP32 reference model. Moreover, our framework generalizes to newer foundation models like the Segment Anything Model (SAM). We achieve over 5x more images processed per second for the base and large variants without mean intersection over union (mIOU) drop over the COCO2017 validation set. In summary, in this thesis, we holistically address the problem of post-training quantization of ViT models. We propose a novel method to tackle the quantization of ViT architecture. In addition, we open-source the QAttn framework, which implements quantized GPU kernels in Triton and integrates with the PyTorch framework. In our research, we demonstrate memory and latency reduction compared to the reference DL model. Moreover, our work lays the foundation for further research into the compression of ViT models.

Keywords

Vision Transformers, Deep Learning, Image Classification, Graphics Processing Unit (GPU), Artificial Intelligence, Computer Vision, Quantization, Foundation Models, Instance Segmentation

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Green
Related to Research communities