
Real-time object understanding is a critical requirement in intelligent computing applications such as autonomous navigation, industrial automation, smart surveillance, and human–machine interaction. Traditional unimodal learning systems rely heavily on visual data alone, limiting their performance under adverse conditions such as occlusion, low lighting, and noisy environments. To address these challenges, this paper proposes a Transformer-Based Multimodal Fusion Model (TMFM) that integrates heterogeneous data sources—including RGB images, depth maps, audio cues, and sensor metadata—into a unified semantic understanding framework. The model employs modality-specific encoders followed by cross-attention–driven fusion layers, enabling effective alignment and interaction among features from different modalities. A shared transformer decoder performs high-level reasoning to generate accurate object representations. Experimental evaluation on benchmark multimodal datasets demonstrates that TMFM improves object recognition accuracy by up to 18% compared to existing CNN- and RNN-based fusion architectures while maintaining real-time inference capability due to its parallel processing design. The proposed model shows strong potential for deployment in next-generation intelligent systems requiring fast, robust, and context-aware object understanding.
Multimodal fusion, transformer model, real-time object understanding, cross-attention, intelligent systems, deep learning, sensor integration.
Multimodal fusion, transformer model, real-time object understanding, cross-attention, intelligent systems, deep learning, sensor integration.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
