M33D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding

Name: M 3 3D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding
Keywords: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

Jamal, Muhammad Abdullah; Mohareri, Omid

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2023

Data sources: arXiv.org e-Print Archive

https://doi.org/10.1109/wacv57...

Article . 2024 . Peer-reviewed

License: STM Policy #29

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2023

License: CC BY NC SA

Data sources: Datacite

M33D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 03 Jan 2024Embargo end date: 01 Jan 2023Publisher:IEEEJournal:2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Authors: Jamal, Muhammad Abdullah; Mohareri, Omid;

doi: 10.1109/wacv57701.2024.00253 , 10.48550/arxiv.2309.15313

arXiv: 2309.15313

M33D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding

- Summary
- Subjects
- Metrics

Abstract

We present a new pre-training strategy called M$^{3}$3D ($\underline{M}$ulti-$\underline{M}$odal $\underline{M}$asked $\underline{3D}$) built based on Multi-modal masked autoencoders that can leverage 3D priors and learned cross-modal representations in RGB-D data. We integrate two major self-supervised learning frameworks; Masked Image Modeling (MIM) and contrastive learning; aiming to effectively embed masked 3D priors and modality complementary features to enhance the correspondence between modalities. In contrast to recent approaches which are either focusing on specific downstream tasks or require multi-view correspondence, we show that our pre-training strategy is ubiquitous, enabling improved representation learning that can transfer into improved performance on various downstream tasks such as video action recognition, video action detection, 2D semantic segmentation and depth estimation. Experiments show that M$^{3}$3D outperforms the existing state-of-the-art approaches on ScanNet, NYUv2, UCF-101 and OR-AR, particularly with an improvement of +1.3\% mIoU against Mask3D on ScanNet semantic segmentation. We further evaluate our method on low-data regime and demonstrate its superior data efficiency compared to current state-of-the-art approaches.

Keywords

FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green