
handle: 10138/356332
Tokenisation is a process, where text is converted into such form, where each item is separated from the rest of the text. Words, for example, are such items, and they must be separated from punctuation marks and diacritics. The most convenient way to do this is to add an empty space on both sides of the item. Tokenisation applies also to diacritics and punctuation marks, and each of them must be separated using empty spaces. It is then easy to verticalize the text, so that the morphological analysis can be performed for each item. In rule-based language technology, we retain the words in their inflected forms. However, we do two operations for them. We rewrite contracted word-forms, used in English, into non-contracted forms. We also convert upper-case letters into lower case, placing an asterisk '*' in front of each converted letter, so that they can later be converted back to upper case.
Computer and information sciences, Languages
Computer and information sciences, Languages
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
