Name: Human-Inspired Computing for Robust and Efficient Audio-Visual Speech Recognition
Keywords: FOS: Computer and information sciences, Sound (cs.SD), Sound, Multimedia, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Audio and Speech Processing, Multimedia (cs.MM)

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Sep 2025Embargo end date: 01 Jan 2024Publisher:Institute of Electrical and Electronics Engineers (IEEE)Journal:IEEE Transactions on Computers, volume 74, pages 2,950-2,961 (issn: 0018-9340, eissn: 2326-3814,

Authors: Qianhui Liu; Jiadong Wang; Yang Wang; Xin Yang; Gang Pan; Haizhou Li;

doi: 10.1109/tc.2025.3582069 , 10.48550/arxiv.2408.16564

arXiv: http://arxiv.org/abs/2408.16564

Human-Inspired Computing for Robust and Efficient Audio-Visual Speech Recognition

- Summary
- Subjects
- Metrics

Abstract

Humans naturally perform audiovisual speech recognition (AVSR), enhancing the accuracy and robustness by integrating auditory and visual information. Spiking neural networks (SNNs), which mimic the brain's information-processing mechanisms, are well-suited for emulating the human capability of AVSR. Despite their potential, research on SNNs for AVSR is scarce, with most existing audio-visual multimodal methods focused on object or digit recognition. These models simply integrate features from both modalities, neglecting their unique characteristics and interactions. Additionally, they often rely on future information for current processing, which increases recognition latency and limits real-time applicability. Inspired by human speech perception, this paper proposes a novel human-inspired SNN named HI-AVSNN for AVSR, incorporating three key characteristics: cueing interaction, causal processing and spike activity. For cueing interaction, we propose a visual-cued auditory attention module (VCA2M) that leverages visual cues to guide attention to auditory features. We achieve causal processing by aligning the SNN's temporal dimension with that of visual and auditory features and applying temporal masking to utilize only past and current information. To implement spike activity, in addition to using SNNs, we leverage the event camera to capture lip movement as spikes, mimicking the human retina and providing efficient visual data. We evaluate HI-AVSNN on an audiovisual speech recognition dataset combining the DVS-Lip dataset with its corresponding audio samples. Experimental results demonstrate the superiority of our proposed fusion method, outperforming existing audio-visual SNN fusion methods and achieving a 2.27% improvement in accuracy over the only existing SNN-based AVSR method.

aceepted by IEEE TC

Related Organizations

National University of Singapore
Singapore
Zhejiang Ocean University
China (People's Republic of)
Chinese University of Hong Kong
China (People's Republic of)
Shandong Women’s University
China (People's Republic of)
Dalian Polytechnic University
China (People's Republic of)

Keywords

FOS: Computer and information sciences, Sound (cs.SD), Sound, Multimedia, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Audio and Speech Processing, Multimedia (cs.MM)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average