
Vision Transformers (ViTs) have achieved state-of-the-art performance across various computer vision tasks, but their high computational cost remains a challenge. Token pruning has been proposed to reduce this cost by selectively removing less important tokens. While effective in vision tasks by discarding non-object regions, applying this technique to audio tasks presents unique challenges, as distinguishing relevant from irrelevant regions in time-frequency representations is less straightforward. In this study, for the first time, we applied token pruning to ViT-based audio classification mResearch goal: What is the impact of dynamic token count on FLOPs efficiency and reasoning accuracy when processing variable-complexity images with different tokenization strategies?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.
