Creating Welsh Language Word Embeddings

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 27 Jul 2021 English Publisher:MDPI AGJournal:Applied Sciences, volume 11, page 6,896 (eissn: 2076-3417,

Copyright policy )

Authors: Padraig Corcoran; Geraint Palmer; Laura Arman; Dawn Knight; Irena Spasić;

doi: 10.3390/app11156896

Creating Welsh Language Word Embeddings

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

Word embeddings are representations of words in a vector space that models semantic relationships between words by means of distance and direction. In this study, we adapted two existing methods, word2vec and fastText, to automatically learn Welsh word embeddings taking into account syntactic and morphological idiosyncrasies of this language. These methods exploit the principles of distributional semantics and, therefore, require a large corpus to be trained on. However, Welsh is a minoritised language, hence significantly less Welsh language data are publicly available in comparison to English. Consequently, assembling a sufficiently large text corpus is not a straightforward endeavour. Nonetheless, we compiled a corpus of 92,963,671 words from 11 sources, which represents the largest corpus of Welsh. The relative complexity of Welsh punctuation made the tokenisation of this corpus relatively challenging as punctuation could not be used for boundary detection. We considered several tokenisation methods including one designed specifically for Welsh. To account for rich inflection, we used a method for learning word embeddings that is based on subwords and, therefore, can more effectively relate different surface forms during the training phase. We conducted both qualitative and quantitative evaluation of the resulting word embeddings, which outperformed previously described word embeddings in Welsh as part of larger study including 157 languages. Our study was the first to focus specifically on Welsh word embeddings.

Related Organizations

Cardiff University
United Kingdom

Keywords

word embeddings, Technology, QH301-705.5, T, Physics, QC1-999, Welsh language, Engineering (General). Civil engineering (General), P1, QA76, human language technology, Chemistry, machine learning, natural language processing, TA1-2040, Biology (General), QD1-999

2 Research products, page 1 of 1

corpuscrawler software on GitHub
IsRelatedTo
wnlt-project software on SourceForge
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	4
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average