Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Article . 2025
License: CC BY
Data sources: ZENODO
ZENODO
Article . 2025
License: CC BY
Data sources: Datacite
ZENODO
Article . 2025
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

A SEMI-SUPERVISED FRAMEWORK NAMED AUGSBERT-UZ FOR HIGH-PERFORMANCE SEMANTIC TEXTUAL SIMILARITY IN UZBEK

Authors: B.B. Muminov, N.M. Allaberganova;

A SEMI-SUPERVISED FRAMEWORK NAMED AUGSBERT-UZ FOR HIGH-PERFORMANCE SEMANTIC TEXTUAL SIMILARITY IN UZBEK

Abstract

Semantic Textual Similarity (STS) is one of the fundamental task of Natural Language Processing (NLP). As Uzbek has scarcity of large-scale annotated datasets, while it is morphologically rich language, STS remains a significant challenge for researchers. Standard Transformer-based cross-encoders offer high accuracy but are computationally prohibitive for large-scale applications, whereas bi-encoders are fast but require substantial training data to perform well. In this paper, we introduce AugSBERT-Uz, a novel semi-supervised model that produces a state-of-the-art sentence embedding model for the Uzbek language. The paper employs a “teacher-student” knowledge distillation approach. First, a high-accuracy cross-encoder (the “teacher”), based on the monolingual BERTbek model, is fine-tuned on a small, human-annotated “gold” dataset. This teacher model is then used to automatically label millions of sentence pairs from a large unlabeled corpus, developing a vast “silver-standard” dataset. Finally, a bi-encoder (the “student”) with a Siamese architecture is trained on this augmented dataset using Multiple Negatives Ranking Loss. The proposed framework enables the Bi-encoder to achieve performance remarkably close to the high-accuracy cross-encoder with 83.2 spearman correlation, while retaining its computational efficiency (inference time response - 5 seconds), making it suitable for large-scale semantic search and clustering tasks. This method effectively bridges the performance gap caused by data scarcity, developing a model that is both accurate and scalable. AugSBERT-Uz presents a novel and scalable solution for developing high-quality semantic representations for low-resource, agglutinative languages. This work provides the first high-performance, publicly available sentence embedding model for Uzbek, paving the way for advancements in regional NLP applications.

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Green