
handle: 10067/282070151162165141 , 1942/788
N-grams are generalized words consisting of N consecutive symbols, as they are used in a text. This paper determines the rank-frequency distribution for redundant N-grams. For entire texts this is known to be Zipf's law (i.e., an inverse power law). For N-grams, however, we show that the rank (r)-frequency distribution is $${\text{P}}_{\text{N}} \left( {\text{r}} \right) = \frac{{\text{C}}}{{{\text{(}}\psi _{\text{N}} ({\text{r))}}^\beta }},$$ , where ψN is the inverse function of fN(x)=x lnN−1x. Here we assume that the rank-frequency distribution of the symbols follows Zipf's law with exponent β.
N-gram; law of Zipf; rank-frequency distribution, CENTRAL-LIMIT-THEOREM; INFORMATION-RETRIEVAL; ZIPFS LAW; SIMILARITY
N-gram; law of Zipf; rank-frequency distribution, CENTRAL-LIMIT-THEOREM; INFORMATION-RETRIEVAL; ZIPFS LAW; SIMILARITY
| citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 25 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
