The impact of indexing approaches on Arabic text classification

descriptionPublicationkeyboard_double_arrow_right Article 10 Jul 2016 English Publisher:SAGE PublicationsJournal:Journal of Information Science, volume 43, pages 159-173 (issn: 0165-5515, eissn: 1741-6485,

Copyright policy )

Authors: Amer Al-Badarneh; Emad Al-Shawakfa; Basel Bani-Ismail; Khaleel Al-Rabab'ah; Safwan Shatnawi;

doi: 10.1177/0165551515625030

The impact of indexing approaches on Arabic text classification

- Summary
- Metrics

Abstract

This paper investigates the impact of using different indexing approaches (full-word, stem, and root) when classifying Arabic text. In this study, the naïve Bayes classifier is used to construct the multinomial classification models and is evaluated using stratified k-fold cross-validation ( k ranges from 2 to 10). It is also uses a corpus that consists of 1000 normalized Arabic documents. The results of one experiment in this study show that significant accuracy improvements have occurred when the full-word form is used in most k-folds. Further experiments show that the classifier has achieved the highest accuracy in the eight-fold by using 7/8–1/8 train–test ratio, despite the indexing approach being used. The overall results of this study show that the classifier has achieved the maximum micro-average accuracy 99.36%, either by using the full-word form or the stem form. This proves that the stem is a better choice to use when classifying Arabic text, because it makes the corpus dataset smaller and this will enhance both the processing time and storage utilization, and achieve the highest level of accuracy.

Related Organizations

Jordan University of Science and Technology
Jordan
University of Bahrain
Bahrain
University of New Brunswick
Canada
Yarmouk University
Jordan
Sultan Qaboos University
Oman

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	18
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%