Tokenisation in rule-based machine translation

Hurskainen Arvi

Found an issue? Give us feedback

HELDA - Digital Repo...arrow_drop_down

HELDA - Digital Repository of the University of Helsinki

Research . 2023

Data sources: HELDA - Digital Repository of the University of Helsinki

Research.fi

Other literature type . 2023

Data sources: Research.fi

Tokenisation in rule-based machine translation

descriptionPublicationkeyboard_double_arrow_right Research , Other literature type 21 Mar 2023 Finland English Publisher:SALAMA - Swahili Language Manager

Authors: Hurskainen Arvi;

handle: 10138/356332

Tokenisation in rule-based machine translation

- Summary
- Subjects
- Metrics

Abstract

Tokenisation is a process, where text is converted into such form, where each item is separated from the rest of the text. Words, for example, are such items, and they must be separated from punctuation marks and diacritics. The most convenient way to do this is to add an empty space on both sides of the item. Tokenisation applies also to diacritics and punctuation marks, and each of them must be separated using empty spaces. It is then easy to verticalize the text, so that the morphological analysis can be performed for each item. In rule-based language technology, we retain the words in their inflected forms. However, we do two operations for them. We rewrite contracted word-forms, used in English, into non-contracted forms. We also convert upper-case letters into lower case, placing an asterisk '*' in front of each converted letter, so that they can later be converted back to upper case.

Country

Finland

Related Organizations

University of Helsinki
Finland

Keywords

Computer and information sciences, Languages

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Related to Research communities

UArctic