When Languages Are Invisible to AI: Cross-Lingual Affective State Detection for Low-Resource Languages (Maithili & Bhojpuri)

Abhimanyu Prasad

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Preprint

Data sources: ZENODO

When Languages Are Invisible to AI: Cross-Lingual Affective State Detection for Low-Resource Languages (Maithili & Bhojpuri)

descriptionPublicationkeyboard_double_arrow_right Preprint Under curationPublisher:Zenodo

Authors: Abhimanyu Prasad;

doi: 10.5281/zenodo.19054785

When Languages Are Invisible to AI: Cross-Lingual Affective State Detection for Low-Resource Languages (Maithili & Bhojpuri)

- Summary

Abstract

Over 100 million speakers of Maithili and Bhojpuri — two linguistically rich languages of the Indo-Aryan family spoken across Bihar, Jharkhand, and Uttar Pradesh — remain almost entirely invisible to modern natural language processing (NLP) systems. While transformer-based sentiment analysis has achieved near-human performance in English, we demonstrate that state-of-the-art monolingual English models collapse to random-chance performance (~33%) when applied to Maithili text, not through stochastic misclassification but through a systematic failure mode we term class collapse: the model produces NEUTRAL predictions for every input regardless of true sentiment polarity. Through attention-weight interpretability analysis, we reveal the precise mechanism: English BERT converts Devanagari script into [UNK] tokens, receiving zero semantic signal, and defaults to its learned neutral prior.We present the first systematic cross-lingual affective state detection study across English, Hindi, Maithili, and Bhojpuri, introducing two original annotated corpora totalling over 73,000 examples. Our four key findings are: (1) multilingual pre-training (XLM-RoBERTa) recovers 35.3 percentage points over English BERT through script knowledge alone, with zero task-specific data; (2) native fine-tuning on as few as 3,563 carefully curated examples achieves 82.44% accuracy (F1 = 0.825), within 2.14 percentage points of the English ceiling of 84.58%; (3) a previously undocumented asymmetric transfer phenomenon exists between Maithili and Bhojpuri — transfer from Maithili to Bhojpuri (75.00%) substantially exceeds the reverse (47.33%), a 27.67 percentage-point gap attributable to differential orthographic standardisation and code-switching rates; and (4) attention analysis reveals the token-level mechanism of failure, demonstrating that fine-tuned models genuinely attend to negation markers and affect-bearing words rather than memorising surface patterns.All datasets, trained model checkpoints, training notebooks, cross-evaluation scripts, and attention visualizations are publicly released at https://huggingface.co/abhiprd20.

Found an issue? Give us feedback