A standard tag set expounding traditional morphological features for Arabic language part-of-speech tagging

Article English OPEN
Sawalha, M ; Atwell, E (2013)
  • Publisher: Edinburgh University Press

The SALMA Morphological Features Tag Set (SALMA, Sawalha Atwell Leeds Morphological Analysis tag set for Arabic) captures long-established traditional morphological features of grammar and Arabic, in a compact yet transparent notation. First, we introduce Part-of-Speech tagging and tag set standards for English and other European languages, and then survey Arabic Part-of-Speech taggers and corpora, and long-established Arabic traditions in analysis of morphology. A range of existing Arabic Part-of-Speech tag sets are illustrated and compared; and we review generic design criteria for corpus tag sets. For a morphologically-rich language like Arabic, the Part-of-Speech tag set should be defined in terms of morphological features characterizing word structure. We describe the SALMA Tag Set in detail, explaining and illustrating each feature and possible values. In our analysis, a tag consists of 22 characters; each position represents a feature and the letter at that location represents a value or attribute of the morphological feature; the dash ‘-’ represents a feature not relevant to a given word. The first character shows the main Parts of Speech, from: noun, verb, particle, punctuation, and Other (residual); these last two are an extension to the traditional three classes to handle modern texts. ‘Noun’ in Arabic subsumes what are traditionally referred to in English as ‘noun’ and ‘adjective’. The characters 2, 3, and 4 are used to represent subcategories; traditional Arabic grammar recognizes 34 subclasses of noun (letter 2), 3 subclasses of verb (letter 3), 21 subclasses of particle (letter 4). Others (residuals) and punctuation marks are represented in letters 5 and 6 respectively. The next letters represent traditional morphological features: gender (7), number (8), person (9), inflectional morphology (10) case or mood (11), case and mood marks (12), definiteness (13), voice (14), emphasized and non-emphasized (15), transitivity (16), rational (17), declension and conjugation (18). Finally there are four characters representing morphological information which is useful in Arabic text analysis, although not all linguists would count these as traditional features: unaugmented and augmented (19), number of root letters (20), verb root (21), types of nouns according to their final letters (22). The SALMA Tag Set is not tied to a specific tagging algorithm or theory, and other tag sets could be mapped onto this standard, to simplify and promote comparisons between and reuse of Arabic taggers and tagged corpora.
  • References (27)
    27 references, page 1 of 3

    Habash, Nizar, Faraj, Reem and Roth, Ryan 2009. Syntactic Annotation in Columbia Arabic Treebank. 2nd International Conference on Arabic Language Resources & Tools MEDAR 2009 Cairo, Egypt.

    Habash, Nizar and Rambow, Owen 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics Ann Arbor, Michigan: Association for Computational Linguistics.

    Habash, Nizar and Roth, Ryan M. 2009. CATiB: The Columbia Arabic Treebank. Proceedings of the ACL-IJCNLP 2009 Conference Short Papers 221-224. Suntec, Singapore.

    Hamada, Salwa 2010. Evaluation of the Arabic Morphological Analyzers? Proceedings of The Sixth International Computing science Conference ICCA Hammamet, Tunisia.

    Harmain, Harmain M. 2004. Arabic Part-of-Speech Tagging. The Fifth Annual U.A.E. University Research Conference United Arab Emirates.

    Johansson, Stig, Atwell, Eric, Garside, Roger and Leech, Geoffrey 1986. The Tagged LOB Corpus. Bergen, Norway: Norwegian Computing Centre for the Humanities.

    Khoja, Shereen 2001. APT: Arabic Part-of-Speech Tagger. Student Workshop at the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL2001) Carnegie Mellon University, Pittsburgh, Pennsylvania.

    Khoja, Shereen 2003. APT: An Automatic Arabic Part-of-Speech Tagger. Lancaster, UK: Lancaster University.

    Khoja, Shereen, Garside, Porger and Knowles, Gerry 2001. A tagset for the morphosynactic tagging of Arabic. Corpus Linguistics 2001 Lancaster University, Lancaster, UK.

    Leech, Geoffrey and Wilson, Andrew 1999. Standards for Tagsets. In Hans van Halteren (ed.), Syntactic Wordclass Tagging. KLUWER Academic Publishers. 55-80.

  • Metrics
    No metrics available
Share - Bookmark