Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ HAL-CEAarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
HAL-CEA
Conference object . 2025
Data sources: HAL-CEA
https://doi.org/10.5220/001328...
Article . 2025 . Peer-reviewed
Data sources: Crossref
versions View all 2 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Dynamic Hierarchical Token Merging for Vision Transformers

Authors: Haroun, Karim; Allenet, Thibault; Ben Chehida, Karim; Martinet, Jean;

Dynamic Hierarchical Token Merging for Vision Transformers

Abstract

Vision Transformers (ViTs) have achieved impressive results in computer vision, excelling in tasks such as image classification, segmentation, and object detection. However, their quadratic complexity $O(N^2)$, where $N$ is the token sequence length, poses challenges when deployed on resource-limited devices. To address this issue, dynamic token merging has emerged as an effective strategy, progressively reducing the token count during inference to achieve computational savings. Some strategies consider all tokens in the sequence as merging candidates, without focusing on spatially close tokens. Other strategies either limit token merging to a local window, or constrains it to pairs of adjacent tokens, thus not capturing more complex feature relationships. In this paper, we propose Dynamic Hierarchical Token Merging (DHTM), a novel token merging approach, where we advocate that spatially close tokens share more information than distant tokens and consider all pairs of spatially close candidates instead of imposing fixed windows. Besides, our approach draws on the principles of Hierarchical Agglomerative Clustering (HAC), where we iteratively merge tokens in each layer, fusing a fixed number of selected neighbor token pairs based on their similarity. Our proposed approach is off-the-shelf, i.e., it does not require additional training. We evaluate our approach on the ImageNet-1K dataset for classification, achieving substantial computational savings while minimizing accuracy reduction, surpassing existing token merging methods.

Keywords

Vision Transformers, [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], [INFO.INFO-CV] Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], Neural network compression, Dynamic neural networks, Token merging

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Green