Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2022
License: CC BY NC ND
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2022
License: CC BY NC ND
Data sources: ZENODO
versions View all 2 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

CoRoLa Frequency Lists

Authors: Păiș, Vasile;

CoRoLa Frequency Lists

Abstract

The Reference Corpus for Contemporary Romanian Language (CoRoLa) was constructed as a priority project of the Romanian Academy. It contains both written texts and oral recordings. Its aim is to cover major functional language styles (legal, scientific, journalistic, imaginative, memoirs, administrative), in four domains (arts and culture, nature, society, science) and in 71 sub-domains while taking into account intellectual property rights (IPR). With over 1 billion word tokens (written and spoken), CoRoLa is one of the largest fully IPR-cleared Reference Corpus in the world. https://corola.racai.ro This dataset contains multiple frequency lists extracted from CoRoLa. There are 12 word-based frequency lists and 12 lemma-based frequency lists. These were constructed only from tokens containing letters (tokens with numbers or special symbols were excluded). Lemmatization was performed automatically at corpus level using the TTL tool. The following files are available: corola_word_freq_all frequency list for all tokens, as they appear in the corpus corola_word_freq_all_nodiacritics frequency list for all tokens, with diacritics removed (replaced with ASCII corresponding letters) corola_word_freq_all_lowercase frequency list for all tokens lowercased corola_word_freq_all_lowercase_nodiacritics frequency list for all tokens lowercased and with diacritics removed corola_word_freq_gte5 frequency list for tokens appearing at least 5 times in the corpus corola_word_freq_gte5_nodiacritics frequency list for tokens appearing at least 5 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters) corola_word_freq_gte5_lowercase frequency list for tokens appearing at least 5 times in the corpus, lowercased corola_word_freq_gte5_lowercase_nodiacritics frequency list for tokens appearing at least 5 times in the corpus, lowercased and with diacritics removed corola_word_freq_gte10 frequency list for tokens appearing at least 10 times in the corpus corola_word_freq_gte10_nodiacritics frequency list for tokens appearing at least 10 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters) corola_word_freq_gte10_lowercase frequency list for tokens appearing at least 10 times in the corpus, lowercased corola_word_freq_gte10_lowercase_nodiacritics frequency list for tokens appearing at least 10 times in the corpus, lowercased and with diacritics removed corola_lemma_freq_all frequency list for all lemmas, as they appear in the corpus corola_lemma_freq_all_nodiacritics frequency list for all lemmas, with diacritics removed (replaced with ASCII corresponding letters) corola_lemma_freq_all_lowercase frequency list for all lemmas lowercased corola_lemma_freq_all_lowercase_nodiacritics frequency list for all lemmas lowercased and with diacritics removed corola_lemma_freq_gte5 frequency list for lemmas appearing at least 5 times in the corpus corola_lemma_freq_gte5_nodiacritics frequency list for lemmas appearing at least 5 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters) corola_lemma_freq_gte5_lowercase frequency list for lemmas appearing at least 5 times in the corpus, lowercased corola_lemma_freq_gte5_lowercase_nodiacritics frequency list for lemmas appearing at least 5 times in the corpus, lowercased and with diacritics removed corola_lemma_freq_gte10 frequency list for lemmas appearing at least 10 times in the corpus corola_lemma_freq_gte10_nodiacritics frequency list for lemmas appearing at least 10 times in the corpus, with diacritics removed (replaced with ASCII corresponding letters) corola_lemma_freq_gte10_lowercase frequency list for lemmas appearing at least 10 times in the corpus, lowercased corola_lemma_freq_gte10_lowercase_nodiacritics frequency list for lemmas appearing at least 10 times in the corpus, lowercased and with diacritics removed Number of entries in each of the released files File # Entries corola_lemma_freq_all_lowercase_nodiacritics 1,375,725 corola_lemma_freq_all_lowercase 1,457,518 corola_lemma_freq_all_nodiacritics 1,562,523 corola_lemma_freq_all 1,635,250 corola_lemma_freq_gte10_lowercase_nodiacritics 227,590 corola_lemma_freq_gte10_lowercase 235,234 corola_lemma_freq_gte10_nodiacritics 242,325 corola_lemma_freq_gte10 248,593 corola_lemma_freq_gte5_lowercase_nodiacritics 351,596 corola_lemma_freq_gte5_lowercase 365,463 corola_lemma_freq_gte5_nodiacritics 380,751 corola_lemma_freq_gte5 392,053 corola_word_freq_all_lowercase_nodiacritics 1,685,410 corola_word_freq_all_lowercase 1,813,746 corola_word_freq_all_nodiacritics 2,112,107 corola_word_freq_all 2,260,992 corola_word_freq_gte10_lowercase_nodiacritics 358,577 corola_word_freq_gte10_lowercase 381,715 corola_word_freq_gte10_nodiacritics 447,538 corola_word_freq_gte10 473,087 corola_word_freq_gte5_lowercase_nodiacritics 517,630 corola_word_freq_gte5_lowercase 553,031 corola_word_freq_gte5_nodiacritics 650,971 corola_word_freq_gte5 690,676

Keywords

word frequency list, CoRoLa, lemma frequency list, Representative Corpus of Contemporary Romanian Language

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 102
    download downloads 21
  • citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    Powered byBIP!BIP!
  • 102
    views
    21
    downloads
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
download
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
downloads
OpenAIRE UsageCountsDownloads provided by UsageCounts
0
Average
Average
Average
102
21