
Semantic Textual Similarity (STS) is one of the fundamental task of Natural Language Processing (NLP). As Uzbek has scarcity of large-scale annotated datasets, while it is morphologically rich language, STS remains a significant challenge for researchers. Standard Transformer-based cross-encoders offer high accuracy but are computationally prohibitive for large-scale applications, whereas bi-encoders are fast but require substantial training data to perform well. In this paper, we introduce AugSBERT-Uz, a novel semi-supervised model that produces a state-of-the-art sentence embedding model for the Uzbek language. The paper employs a “teacher-student” knowledge distillation approach. First, a high-accuracy cross-encoder (the “teacher”), based on the monolingual BERTbek model, is fine-tuned on a small, human-annotated “gold” dataset. This teacher model is then used to automatically label millions of sentence pairs from a large unlabeled corpus, developing a vast “silver-standard” dataset. Finally, a bi-encoder (the “student”) with a Siamese architecture is trained on this augmented dataset using Multiple Negatives Ranking Loss. The proposed framework enables the Bi-encoder to achieve performance remarkably close to the high-accuracy cross-encoder with 83.2 spearman correlation, while retaining its computational efficiency (inference time response - 5 seconds), making it suitable for large-scale semantic search and clustering tasks. This method effectively bridges the performance gap caused by data scarcity, developing a model that is both accurate and scalable. AugSBERT-Uz presents a novel and scalable solution for developing high-quality semantic representations for low-resource, agglutinative languages. This work provides the first high-performance, publicly available sentence embedding model for Uzbek, paving the way for advancements in regional NLP applications.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
