
Wide-range multiscale object detection for multispectral scene perception from a drone perspective is challenging. Previous RGB-T perception methods directly use backbone pretrained on RGB for thermal infrared feature extraction, leading to unexpected domain shift. We propose a novel multimodal feature-guided masked reconstruction pretraining method, named M2FP, aimed at learning transferable representations for drone-based RGB-T environmental perception tasks without domain bias. This article includes two key innovations as follows. 1) We design a cross-modal feature interaction module in M2FP, which encourages modality-specific backbones to actively learn cross-modal feature representations and avoid modality bias issues. 2) We design a global-aware feature interaction and fusion module suitable for various downstream tasks, which enhances the model's environmental perception from a global perspective in wide-range drone-based scenes. We fine-tune M2FP on the drone-based object detection dataset (DroneVehicle) and semantic segmentation dataset (Kust4K). On these two tasks, compared to the second-best methods, M2FP achieves state-of-the-art performance, with an improvement of 1.8% in mean average precision and 0.9% in mean intersection over union, respectively.
Ocean engineering, QC801-809, Masked autoencoder, Geophysics. Cosmic physics, multimodal, object detection, TC1501-1800, semantic segmentation, unmanned aerial vehicle (UAV) remote sensing
Ocean engineering, QC801-809, Masked autoencoder, Geophysics. Cosmic physics, multimodal, object detection, TC1501-1800, semantic segmentation, unmanned aerial vehicle (UAV) remote sensing
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 2 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
