Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2024
License: CC BY
Data sources: ZENODO
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Diverge-Gemini POS-tagged Corpus of Modern Tibetan

Authors: Kyogoku, Yuki; Erhard, Franz Xaver; Barnett, Robert; Hill, Nathan;

Diverge-Gemini POS-tagged Corpus of Modern Tibetan

Abstract

Diverge-Gemini POS-tagged Corpus of Modern Tibetan is a modern Tibetan text corpus compiled from a wide range of sources of modern Tibetan, including Tibetan language books and newspapers from the 1950s, 1960s as well as 2000s, published in the Republic of India and the People's Republic of China of automatically Part-of-Speech (POS)-tagged with Google's Gemini Pro 1.5 model via the Google Cloud API using UD tags. Tagging was done using the Divergent Discourses Gemini Pro 1.5 POS-tagger. To avoid arbitrary tokenization the raw data was tokenised with the Modern_Botok dialect pack for Botok v3.13 before Gemini-POS-tagging. The files are in CONLLU format. Diverge-Gemini POS-tagged Modern Tibetan Corpus.zip contains the raw files as returned from Gemini Pro 1.5 Diverge-Gemini POS-tagged Modern Tibetan Corpus Normalised.zip contains a set of cleaned-up and normalised files. The following sources were used: (I) Books: Ma hphung 馬烽. 1954. Gnyen bsgrigs kyi lo rgyus. Pe cin: Mi rigs dpe skrun khang (50kb). Hri hphun 石峯. 1955. Krung go'i mi dmangs rang 'thad dmag gi dmag mi phal ba zhig gi lo rgyus: Mtsho sngon mi dmangs dpe skrun khang (95kb). Le'u hro'o chi 劉少奇. 1950. Le'u hro'o chi'i lnga gcig gtams bshad. Pe cin: Krung dbyang mi dmangs srid gzhung mi rigs don byed u yon lhan khang (93kb). Lin khru'u 林初. 1955. Deng dus kyi the wan. Pe cin: Mi rigs dpe skrun khang (131kb). Ma'o tse tung 毛澤東. 1952. Dmangs gtso'i ring lugs gsar pa'i bstan bcos. Pe cing: Krung dbyang mi dmangs srid gzhung mi rigs don byed u yon lhan khang (383kb). Hu yun 胡芸. 1957. Mes rgyal gyi yul ljongs. pe cin: Mi rigs dpe skrun khang (162kb). Nyi zla skar gsum. 1955. Pe cin: Mi rigs dpe skrun khang (45kb). (II) Newspapers: (1) transcribed by Divergent Discourses bod mi'i rang dbang (India, 13 issues of 1965, 666kb) dar mdo'i gsar 'gyur (PRC, ten issues from 1954-55, 1MB) dkar mdzes nyin re'i gsar 'gyur (PRC, 10 issues from 1959, 672kb) gsar 'gyur mdor bsdus (PRC, 16 issues from the years 1953-1954, 895kb) kan lho'i gsar 'gyur (PRC, 12 issues from 195, 517kb9) min ciang gsar 'gyur (PRC, nine issues from 1953-59, 783kb) mtsho sngon bod yig gsar 'gyur (PRC, 14 issues from the years 1951-1965, 1,2MB) Rang dbang gsar shog (India, seven issues from 1961-1965, 594kb) Rang dbang srung skyob gsar shog (India, five issues from 1963-65, 226kb) yul phyogs so so'i gsr 'gyur me long (India, 12 issues from the years 1950-63, 938kb) (2) scraped from the internet: (2.1) Esukhia Tibetan news corpus from India): Bangchen (12,542 articles, 121MB) BOD Asia (161MB) Gyalwa Rinpoche (575 articles, 9,4MB) Radio Free Asia (26,890 articles, 117MB) Tibet Times (18,090 articles, 218MB) Voice of America (VOA) Tibetan (1,100 articles, 13MB) Voice of Tibet (VOT) (7,452 articles, 68MB) (2.2) from the PRC: bod ljongs nyin re'i tshags par (4,155 articles, 162MB) The corpus was tagged for the Divergent Discourses project led by Franz Xaver Erhard (Leipzig University and Robert Barnett (SOAS, London)

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average