Authdetect: Model for Detecting Authoritarian Discourse in Political Speeches

This is the official replication repository for the paper Chasing the Authoritarian Specter: Detecting Authoritarian Discourse with Large Language Models published in the European Journal of Political Research. It contains raw datasets for training and validating the authdetect replication scripts, a quick walkthrough of the model (YT tutorial), and a complete Jupyter notebook for using the model on users' own data in Google Colab. authdetect is a classification model for detecting authoritarian discourse in political speeches, leveraging a novel approach to studying latent political concepts through language modeling. Rather than relying on predefined rules or rigid definitions of authoritarian discourse, the model operates on the premise that authoritarian leaders naturally exhibit such discourse in their speech patterns. Essentially, the model assumes that "authoritarians talk like authoritarians," allowing it to discern instances of authoritarian rhetoric from speech segments. Structured as a regression problem with weak supervision logic, the model classifies text segments based on their association with either authoritarian or democratic discourse. By training on speeches from both authoritarian and democratic leaders, it learns to distinguish between these two distinct forms of political rhetoric. The model is finetuned on top of roberta-base model using 77 years of speech data from the UN General Assembly. Training design combines the transcripts of political speeches in English with a weak supervision setup under which the training data are annotated with the V-Dem polyarchy index (i.e., polyarchic status) as the reference labels. The model is trained for predicting the index value of a speech, linking the presented narratives with the virtual quality of democracy of the speaker’s country (rather than with the speaker himself). The corpus quality ensures robust temporal (1946–2022) and spatial (197 countries) representation, resulting in a well-balanced training dataset. Although the training data are domain-specific (the UN General Assembly), the model trained on the UNGD corpus appears to be robust across various sub-domains, demonstrating its capacity to scale well across regions and contexts. Rather than using whole speeches as input data for training, the model utilizes a sliding window of sentence trigrams splitting the raw transcripts into uniform snippets of text mapping the political language of world leaders. As the goal is to model the varying context of presented ideas in the analyzed speeches rather than the context of the UN General Assembly debates, the main focus is on the particularities of the language of reference groups (authoritarian/democratic leaders). The final dataset counts 1 062 286 sentence trigrams annotated with EDI scores inherited from the parent documents (μ = 0.430, 95% CI [0.429, 0.430]). Video tutorial The official repository includes a comprehensive walkthrough tutorial that demonstrates how to use the authdetect model. This tutorial is designed to help users quickly analyze their data with ease. By downloading the interactive Jupyter notebook and the sample data (how_to_use_authdetect.ipynb, sample_data.csv), anyone can follow the step-by-step instructions and run the pipeline effortlessly using Google Colab, enabling them to try it themselves and get results in no time. The whole process can also be followed in a tutorial video available at: https://www.youtube.com/watch?v=CRy9uxMChoE. NOTE on v1.01: As of June 1, 2025, the original tutorial using the trankit library no longer works due to broken dependencies that cannot be resolved within the same Colab session. As a workaround, the trankit library has been replaced with the stanza toolkit (how_to_use_authdetect_w_stanza.ipynb). Stanza performs the same functions as trankit and does not have the same dependency compatibility issues. This is the recommended pipeline for Google Colab and serves as a functional alternative to trankit, if needed. HuggingFace The model is also uploaded on Hugging Face, where users can easily download it and take advantage of the existing support for seamless implementation. The Zenodo repository contains the model solely for archival purposes related to the paper. Additionally, the Hugging Face archive includes a minimalistic example demonstrating the model's application. For a complete pipeline, users are encouraged to utilize the interactive notebook and watch the tutorial available on YouTube. You can explore the repository at: https://huggingface.co/mmochtak/authdetect. If you use the repository, please cite: @article{mochtak_chasing_2024, title = {Chasing the authoritarian spectre: {Detecting} authoritarian discourse with large language models}, issn = {1475-6765}, shorttitle = {Chasing the authoritarian spectre}, url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/1475-6765.12740}, doi = {10.1111/1475-6765.12740}, journal = {European Journal of Political Research}, author = {Mochtak, Michal}, keywords = {authoritarian discourse, deep learning, detecting authoritarianism, model, political discourse}, }

Keywords

LLM, authoritarianism, machine learning, political speeches, RoBERTa, langauge modeling

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average