A SEMI-SUPERVISED FRAMEWORK NAMED AUGSBERT-UZ FOR HIGH-PERFORMANCE SEMANTIC TEXTUAL SIMILARITY IN UZBEK

Semantic Textual Similarity (STS) is one of the fundamental task of Natural Language Processing (NLP). As Uzbek has scarcity of large-scale annotated datasets, while it is morphologically rich language, STS remains a significant challenge for researchers. Standard Transformer-based cross-encoders offer high accuracy but are computationally prohibitive for large-scale applications, whereas bi-encoders are fast but require substantial training data to perform well. In this paper, we introduce AugSBERT-Uz, a novel semi-supervised model that produces a state-of-the-art sentence embedding model for the Uzbek language. The paper employs a “teacher-student” knowledge distillation approach. First, a high-accuracy cross-encoder (the “teacher”), based on the monolingual BERTbek model, is fine-tuned on a small, human-annotated “gold” dataset. This teacher model is then used to automatically label millions of sentence pairs from a large unlabeled corpus, developing a vast “silver-standard” dataset. Finally, a bi-encoder (the “student”) with a Siamese architecture is trained on this augmented dataset using Multiple Negatives Ranking Loss. The proposed framework enables the Bi-encoder to achieve performance remarkably close to the high-accuracy cross-encoder with 83.2 spearman correlation, while retaining its computational efficiency (inference time response - 5 seconds), making it suitable for large-scale semantic search and clustering tasks. This method effectively bridges the performance gap caused by data scarcity, developing a model that is both accurate and scalable. AugSBERT-Uz presents a novel and scalable solution for developing high-quality semantic representations for low-resource, agglutinative languages. This work provides the first high-performance, publicly available sentence embedding model for Uzbek, paving the way for advancements in regional NLP applications.

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green