Vision Transformers Are Robust Learners

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 28 Jun 2022Embargo end date: 01 Jan 2021Publisher:Association for the Advancement of Artificial Intelligence (AAAI)Journal:Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2,071-2,081 (issn: 2159-5399, eissn: 2374-3468,

Copyright policy )

Authors: Sayak Paul; Pin-Yu Chen;

doi: 10.1609/aaai.v36i2.20103 , 10.48550/arxiv.2105.07581

arXiv: 2105.07581

Vision Transformers Are Robust Learners

- Summary
- Subjects
- Metrics

Abstract

Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art (SOTA) standard accuracy. What remains largely unexplored is their robustness evaluation and attribution. In this work, we study the robustness of the Vision Transformer (ViT) (Dosovitskiy et al. 2021) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We use six different diverse ImageNet datasets concerning robust classification to conduct a comprehensive performance comparison of ViT(Dosovitskiy et al. 2021) models and SOTA convolutional neural networks (CNNs), Big-Transfer (Kolesnikov et al. 2020). Through a series of six systematically designed experiments, we then present analyses that provide both quantitative andqualitative indications to explain why ViTs are indeed more robust learners. For example, with fewer parameters and similar dataset and pre-training combinations, ViT gives a top-1accuracy of 28.10% on ImageNet-A which is 4.3x higher than a comparable variant of BiT. Our analyses on image masking, Fourier spectrum sensitivity, and spread on discrete cosine energy spectrum reveal intriguing properties of ViT attributing to improved robustness. Code for reproducing our experiments is available at https://git.io/J3VO0.

Related Organizations

IBM Research
IBM Research
IBM Research
IBM Research - China
China (People's Republic of)
IBM Research (International)

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	146
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 1%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 1%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 0.1%

Found an issue? Give us feedback

146

Top 1%

Top 0.1%

Green

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering