Feature Extraction in Subject Classification of Text Documents in Polish

Name: Feature Extraction in Subject Classification of Text Documents in Polish
Keywords: 13. Climate action

descriptionPublicationkeyboard_double_arrow_right Part of book or chapter of book , Article 01 Jan 2018 English Publisher:Springer International Publishing

Authors: Henryk Maciejewski; Szymon Datko; Tomasz Walkowiak;

doi: 10.1007/978-3-319-91262-2_40

Feature Extraction in Subject Classification of Text Documents in Polish

- Summary
- Metrics

Abstract

In this work we evaluate two different methods for deriving features for a subject classification of text documents. The first method uses the standard Bag-of-Words (BoW) approach, which represents the documents with vectors of frequencies of selected terms appearing in the documents. This method heavily relies on the natural language processing (NLP) tools to properly preprocess text in the grammar- and inflection-conscious way. The second approach is based on the word-embedding technique recently proposed by Mikolov and does not require any NLP preprocessing. In this method the words are represented as vectors in continuous space and this representation of words is used to construct the feature vectors of the documents. We evaluate these fundamentally different approaches in the task of classification of Polish language Wikipedia articles with 34 subject areas. Our study suggests that the word-embedding based features seem to outperform the standard NLP-based features providing sufficiently large training dataset is available.

Related Organizations

Wrocław University of Science and Technology
Poland

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	5
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%