
This dataset contains the subs2vec embeddings for Tegulu, as presented in https://zenodo.org/records/17243814. The embeddings were trained on large-scale subtitle corpora and represent semantic vector spaces derived from naturalistic language use in films and television from the OpenSubtitles 2018 datasets: https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles. For this language, we provide all embedding variants explored in the study. Specifically, the dataset includes vectors generated under different combinations of: Dimensionality: multiple vector sizes (e.g., 100, 200, 300, …) Window size: varying context windows (e.g., 2, 5, 10, …) Each file corresponds to a unique configuration (dimension × window size). Each file contains the vocabulary for that language (column 1) and then the embedding values (columns 2 through dimension size + 1). If you use this dataset, please cite: Manuscript: https://doi.org/10.5281/zenodo.17243812 Data: This Zenodo dataset (using the DOI provided here)
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
