Integrating Lip Dynamics into Visual Speech Framework

Name: Integrating Lip Dynamics into Visual Speech Framework
Keywords: Visual Speech Recognition, Lip Movement Analysis, Multimodal Speech Recognition, Deep Learning, Convolutional Neural Networks (CNN)

Soham Akhade; J.C.Musale; S.J.Nawale; Omkar Jadhav; Atharva Bhadale; Prathmesh Gaikwad

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Article . 2024

License: CC BY

Data sources: ZENODO

ZENODO

Article . 2024

License: CC BY

Data sources: Datacite

ZENODO

Article . 2024

License: CC BY

Data sources: Datacite

Integrating Lip Dynamics into Visual Speech Framework

descriptionPublicationkeyboard_double_arrow_right Article 07 Jun 2024 English Publisher:Zenodo

Authors: Soham Akhade; J.C.Musale; S.J.Nawale; Omkar Jadhav; Atharva Bhadale; Prathmesh Gaikwad;

doi: 10.5281/zenodo.11516615 , 10.5281/zenodo.11516616

Integrating Lip Dynamics into Visual Speech Framework

- Summary
- Subjects
- Metrics

Abstract

Visual Speech Recognition (VSR) is a rapidly evolving field with diverse applications in human-computer interaction, accessibility, and security. This paper presents an innovative approach to VSR, focusing on the extraction and analysis of lip movements for speech recognition. Traditional speech recognition systems rely primarily on acoustic information, making them vulnerable to noisy environments and audio disturbances. In contrast, our proposed method leverages the visual modality by harnessing the rich information encoded in lip movements during speech production. The study begins by collecting a comprehensive dataset of visual and audio recordings of speech in various languages and contexts. Subsequently, a deep learning architecture is designed to process the visual data, emphasizing lip movements, and the corresponding audio data. The proposed model integrates convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to extract and fuse information from both modalities. This fusion process enhances the robustness of the system by mitigating the limitations of traditional audio-only speech recognition. We evaluate the performance of the visual- based speech recognition system on a range of benchmark datasets and real-world scenarios. The results demonstrate the efficacy of our approach, highlighting its capacity to improve recognition accuracy, particularly in noisy environments or situations where audio data is incomplete or unavailable. In conclusion, our research contributes to the advancement of Visual Speech Recognition by introducing a novel approach that emphasizes lip movement analysis. By leveraging both audio and visual modalities, the proposed system provides a more robust and versatile solution for speech recognition, with the potential to enhance applications in human-computer interaction, accessibility, and security.

Keywords

Visual Speech Recognition, Lip Movement Analysis, Multimodal Speech Recognition, Deep Learning, Convolutional Neural Networks (CNN)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average