Specialised POS Tagged Syriac Corpus for State Morphology

Overview A total of twelve .TXT files each representing a Syriac text that has been transcribed and tagged for part-of-speech (POS). This corpus forms part of a PhD research project on the historical syntax of Aramaic (Syriac) at The Australian National University (2020—current) in Canberra, Australia. This research project is interested in noun state morphology, among other topics, which is reflected in the POS scheme for this corpus. Method A detailed summary of this methodology is provided in El-Khaissi (data paper in review with the Journal of Open Data Humanities). Transcriptions are sourced from Digital Syriac Corpus. POS tags are based on word matches using SEDRA IV API (v1.0.0). Selection of Syriac texts was optimised to minimise external influence on Syriac grammar and maximise full coverage of key periods of the Syriac language from 2nd—13th century AD. POS Format & Abbreviations POS tags in the text files follow the following format: -_ Thus, an underscore '_' marks the beginning of a tag sequence while tag values are separated by hyphen(s) '-'. For example (noting text directionality constraints): ܒܘܪܟܬܐ_EMP-N The following abbreviation lists the definition of all POS tags, which are based on the parameters available in SEDRA IV API (v1.0.0). Absolute state noun (indeterminate relic) ABS Emphatic state noun (new indeterminate) EMP Construct state noun (bound noun) CNS State not applicable X particle PTCL pronoun PRO preposition PREP verb V denominative DEN noun N numeral NUM substantive SBV adjective ADJ proper noun PN adverb ADV demonym DNM participle adjective PTCPADJ adverb ADV idiom IDM See Quality Control & Limitations below DUP Quality Control & Limitations On average per manuscript, the POS-tagging process achieved a 63.13% saturation of texts. The POS tagging process was based on an exact-match process, which does not take into account syntactic or semantic context. Syriac words which exhibit homonymy are thus tagged with the value 'DUP' and should be assessed manually based on its original context. Among all 297,981 words in the corpus with an available POS tag, approximately 73,188 (24.56%) of tags reflected some kind of homonymy involving a word with various semantic and/or syntactic interpretations. Since this dataset was created as part of a research project investigating noun state morphology, additional tags were created targetting various state values. Grammatical elements, like number and gender, were not required as part of this investigation and therefore excluded from the POS-tagging process. Contact For any questions, please contact Charbel El-Khaissi .

Related Organizations

Australian National University
Australia

Keywords

Syriac, POS tagging, historical linguistics, manuscripts

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average