Automated Speech Act Annotation in a Russian Spoken Corpus Using Large Language Models: A Comparative Study

Sherstinova, Tatiana; Firsanova, Viktoria; Novoseltseva, Alena; Megre, Mariya; Savchenko, Egor

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Conference object . 2024

License: CC BY

Data sources: ZENODO

ZENODO

Article . 2024

License: CC BY

Data sources: Datacite

ZENODO

Article . 2024

License: CC BY

Data sources: Datacite

Automated Speech Act Annotation in a Russian Spoken Corpus Using Large Language Models: A Comparative Study

descriptionPublicationkeyboard_double_arrow_right Article , Conference object 01 Nov 2024 English Publisher:FRUCT Oy

Authors: Sherstinova, Tatiana; Firsanova, Viktoria; Novoseltseva, Alena; Megre, Mariya; Savchenko, Egor;

doi: 10.5281/zenodo.14166352 , 10.5281/zenodo.14166351

Automated Speech Act Annotation in a Russian Spoken Corpus Using Large Language Models: A Comparative Study

- Summary
- Subjects
- Metrics

Abstract

The research focuses on the automatic annotation of a linguistic corpus using large language models (LLMs). Annotating a corpus is a crucial step in its creation, as it determines the practical scope and applications of the resource being developed. This study explores the annotation of oral speech transcripts at the pragmatic level using speech acts that reflect the speaker's intent and purpose. Typically, this task is performed manually by experts, which greatly limits the volume of annotated data that can be produced. In this work, an attempt was made to automatically annotate speech acts using five LLMs commonly used for processing Russian texts – ChatGPT, GigaCHAT, YandexGPT, Mistral, and Gemini. A comparative analysis of the automatic annotation results was conducted, highlighting the strengths and weaknesses of each model. . The findings suggest that employing LLMs for corpus annotation is a promising approach, with ChatGPT and Gemini demonstrating particular effectiveness in speech act categorization. However, for Russian, language-specific models like GigaCHAT and YandexGPT are preferred when language-specific information is needed.

Related Organizations

University and State Library of Saxony Anhalt
Germany

Keywords

spoken speech pragmatics corpus linguistics speech acts pragmatic annotation LLMs

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green