
Abstract. Gesture recognition remains a critical challenge in human-computer interaction due to issues such as lighting variations, background noise, and limited annotated datasets, particularly for underrepresented sign languages. To address these limitations, we propose G-MAE (Gesture-aware Masked Autoencoder), a self-supervised framework leveraging a Gesture-aware Multi-Scale Transformer (GMST) backbone that integrates multi-scale dilated convolutions (MSDC), multi-head self-attention (MHSA), and a multi-scale contextual feedforward network (MSC-FFN) to capture both local and long-range spatiotemporal dependencies. Pre-trained on the Slovo corpus with 50–70% masking and fine-tuned on TheRusLan, G-MAE achieves 94.48% accuracy, with ablation studies confirming the contributions of each component. Removing MSDC, MSC-FFN, or MHSA reduces accuracy to 92.67%, 91.95%, and 90.54%, respectively. The optimal masking ratio (50–70%) balances information retention and learning efficiency, demonstrating robust performance even with limited labeled data, thus advancing gesture recognition in resource-constrained scenarios.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
