Quick search
Advanced search in
Field to searchTerm
Add rule
Download Results
178 research outcomes, page 1 of 18
  • other research product . lexicalConceptualResource . 2021
    Open Access Slovenian
    Authors:
    Krek, Simon; Gantar, Apolonija; Laskowski, Cyprian; Krsnik, Luka; Kosem, Iztok; Brank, Janez; Dobrovoljc, Kaja; Arhar Holdt, Špela; Čibej, Jaka; Robnik-Šikonja, Marko; ...
    Publisher: Centre for Language Resources and Technologies, University of Ljubljana

    The MWE lexicon was extracted from the Gigafida 2.1 Corpus of Written Standard Slovene (https://www.clarin.si/noske/run.cgi/corp_info?corpname=gfida21) using specialized scripts for extracting data from corpora containing syntactic dependency annotations. The lexicon co...

  • other research product . corpus . 2021
    Open Access Slovenian
    Authors:
    Ahačič, Kozma; Atelšek, Simon; Erjavec, Tomaž; Holozan, Peter; Jakop, Nataša; Jemec Tomazin, Mateja; Ježovnik, Janoš; Ledinek, Nina; Perdih, Andrej; Romih, Miro; ...
    Publisher: ZRC SAZU

    Corpus of Slovenian school texts is a lemmatized and POS-tagged specialized corpus, which includes 428 short school texts written primarily by primary-school students from 1st to 5th grades from 2017 to 2020. The corpus consists of approximately 95,000 tokens and was de...

  • other research product . corpus . 2021
    Slovenian
    Authors:
    Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola; Ferme, Marko; Borovič, Mladen; Boškovič, Borko; Ojsteršek, Milan; Hrovat, Goran;
    Publisher: Jožef Stefan Institute

    The KAS-abs corpus contains 108,254 automatically identified Slovenian and/or English abstracts (30 million words) from 62,000 BSc/BA, MSc/MA, and PhD theses included in the KAS Corpus of Academic Slovene. This corpus is made available because the public version of KAS ...

  • other research product . corpus . 2021
    Open Access Slovenian
    Authors:
    Krek, Simon; Dobrovoljc, Kaja; Erjavec, Tomaž; Može, Sara; Ledinek, Nina; Holz, Nanika; Zupan, Katja; Gantar, Polona; Kuzman, Taja; Čibej, Jaka; ...
    Publisher: Centre for Language Resources and Technologies, University of Ljubljana

    The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. About half of the corpus is also manually annotated with syntactic dependencies, named entities...

  • other research product . toolService . 2021
    Open Access Slovenian
    Authors:
    Ljubešić, Nikola; Krsnik, Luka;
    Publisher: Jožef Stefan Institute

    The model for lemmatisation of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k training corpus (http://hdl.handle.net/11356/1210) and using the Sloleks inflectional lexicon (http...

  • other research product . corpus . 2021
    Open Access Slovenian
    Authors:
    Žejn, Andrejka; Erjavec, Tomaž;
    Publisher: ZRC SAZU

    The PriLit corpus contains 37 texts of older Slovenian narrative prose by 12 authors. One text, Sreča v nesreči (Fortune in Misfortune) by Janez Cigler (first published in 1836), is present in 7 editions, leading to 43 texts in the corpus. The texts were published 1643 ...

  • other research product . lexicalConceptualResource . 2020
    Open Access Slovenian
    Authors:
    Čibej, Jaka; Arhar Holdt, Špela; Krek, Simon;
    Publisher: Centre for Language Resources and Technologies, University of Ljubljana

    This entry consists of a TSV file containing a list of 66,347 Slovene word pairs from the Sloleks Morphological Lexicon of Slovene (v2.0; http://hdl.handle.net/11356/1230) that have been automatically identified as morphologically related according to a number of manual...

  • other research product . corpus . 2020
    Open Access Slovenian
    Authors:
    Zwitter Vitez, Ana; Zemljarič Miklavčič, Jana; Krek, Simon; Stabej, Marko; Erjavec, Tomaž; Verdonik, Darinka; Krajnc Ivič, Mira; Antloga, Špela; Majhenič, Simona;
    Publisher: Centre for Language Resources and Technologies, University of Ljubljana

    The GORDAN 1.0 corpus contains authentic data of spoken communication, annotated for dialogue acts. This entry contains the complete audio files of the corpus (seven wav files, 1 hour of recording), and video files (four mp4 video files). Video files are provided only f...

  • other research product . toolService . 2020
    Open Access Slovenian
    Authors:
    Ljubešić, Nikola;
    Publisher: Jožef Stefan Institute

    The model for lemmatisation of non-standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k training corpus (http://hdl.handle.net/11356/1210) and the Janes-Tag corpus (http://hdl.handle....

  • other research product . lexicalConceptualResource . 2020
    Open Access Slovenian
    Authors:
    Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon;
    Publisher: Centre for Language Resources and Technologies, University of Ljubljana

    Frequency lists of character-level n-grams were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain 1-5-gram combinations of characters occurri...

178 research outcomes, page 1 of 18