Multimae Meets Earth Observation: Pre-Training Multi-Modal Multi-Task Masked Autoencoders for Earth Observation Tasks

Name: Multimae Meets Earth Observation: Pre-Training Multi-Modal Multi-Task Masked Autoencoders for Earth Observation Tasks
Keywords: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

Jose Sosa; Danila Rukhovich; Anis Kacem 0001; Djamila Aouada

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2025

Data sources: arXiv.org e-Print Archive

https://doi.org/10.1109/icip55...

Article . 2025 . Peer-reviewed

License: STM Policy #29

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2025

License: CC BY

Data sources: Datacite

DBLP

Article

Data sources: DBLP

Multimae Meets Earth Observation: Pre-Training Multi-Modal Multi-Task Masked Autoencoders for Earth Observation Tasks

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 14 Sep 2025Embargo end date: 01 Jan 2025Publisher:IEEEJournal:2025 IEEE International Conference on Image Processing (ICIP)

Authors: Jose Sosa; Danila Rukhovich; Anis Kacem 0001; Djamila Aouada;

doi: 10.1109/icip55913.2025.11084679 , 10.48550/arxiv.2505.14951

arXiv: 2505.14951

Multimae Meets Earth Observation: Pre-Training Multi-Modal Multi-Task Masked Autoencoders for Earth Observation Tasks

- Summary
- Subjects
- Metrics

Abstract

Multi-modal data in Earth Observation (EO) presents a huge opportunity for improving transfer learning capabilities when pre-training deep learning models. Unlike prior work that often overlooks multi-modal EO data, recent methods have started to include it, resulting in more effective pre-training strategies. However, existing approaches commonly face challenges in effectively transferring learning to downstream tasks where the structure of available data differs from that used during pre-training. This paper addresses this limitation by exploring a more flexible multi-modal, multi-task pre-training strategy for EO data. Specifically, we adopt a Multi-modal Multi-task Masked Autoencoder (MultiMAE) that we pre-train by reconstructing diverse input modalities, including spectral, elevation, and segmentation data. The pre-trained model demonstrates robust transfer learning capabilities, outperforming state-of-the-art methods on various EO datasets for classification and segmentation tasks. Our approach exhibits significant flexibility, handling diverse input configurations without requiring modality-specific pre-trained models. Code will be available at: https://github.com/josesosajs/multimae-meets-eo.

Related Organizations

University of Luxembourg
Luxembourg

Keywords

FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green