
Visual Speech Recognition (VSR) is a rapidly evolving field with diverse applications in human-computer interaction, accessibility, and security. This paper presents an innovative approach to VSR, focusing on the extraction and analysis of lip movements for speech recognition. Traditional speech recognition systems rely primarily on acoustic information, making them vulnerable to noisy environments and audio disturbances. In contrast, our proposed method leverages the visual modality by harnessing the rich information encoded in lip movements during speech production. The study begins by collecting a comprehensive dataset of visual and audio recordings of speech in various languages and contexts. Subsequently, a deep learning architecture is designed to process the visual data, emphasizing lip movements, and the corresponding audio data. The proposed model integrates convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to extract and fuse information from both modalities. This fusion process enhances the robustness of the system by mitigating the limitations of traditional audio-only speech recognition. We evaluate the performance of the visual- based speech recognition system on a range of benchmark datasets and real-world scenarios. The results demonstrate the efficacy of our approach, highlighting its capacity to improve recognition accuracy, particularly in noisy environments or situations where audio data is incomplete or unavailable. In conclusion, our research contributes to the advancement of Visual Speech Recognition by introducing a novel approach that emphasizes lip movement analysis. By leveraging both audio and visual modalities, the proposed system provides a more robust and versatile solution for speech recognition, with the potential to enhance applications in human-computer interaction, accessibility, and security.
Visual Speech Recognition, Lip Movement Analysis, Multimodal Speech Recognition, Deep Learning, Convolutional Neural Networks (CNN)
Visual Speech Recognition, Lip Movement Analysis, Multimodal Speech Recognition, Deep Learning, Convolutional Neural Networks (CNN)
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
