Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Article . 2025
License: CC BY
Data sources: Datacite
ZENODO
Article . 2025
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Transformer-Based Multimodal Fusion Model for Real-Time Object Understanding

Authors: Emily Carter¹, Daniel Morgan², Sophia Hayes³;

Transformer-Based Multimodal Fusion Model for Real-Time Object Understanding

Abstract

Real-time object understanding is a critical requirement in intelligent computing applications such as autonomous navigation, industrial automation, smart surveillance, and human–machine interaction. Traditional unimodal learning systems rely heavily on visual data alone, limiting their performance under adverse conditions such as occlusion, low lighting, and noisy environments. To address these challenges, this paper proposes a Transformer-Based Multimodal Fusion Model (TMFM) that integrates heterogeneous data sources—including RGB images, depth maps, audio cues, and sensor metadata—into a unified semantic understanding framework. The model employs modality-specific encoders followed by cross-attention–driven fusion layers, enabling effective alignment and interaction among features from different modalities. A shared transformer decoder performs high-level reasoning to generate accurate object representations. Experimental evaluation on benchmark multimodal datasets demonstrates that TMFM improves object recognition accuracy by up to 18% compared to existing CNN- and RNN-based fusion architectures while maintaining real-time inference capability due to its parallel processing design. The proposed model shows strong potential for deployment in next-generation intelligent systems requiring fast, robust, and context-aware object understanding.

Keywords

Multimodal fusion, transformer model, real-time object understanding, cross-attention, intelligent systems, deep learning, sensor integration.

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!