

You have already added 0 works in your ORCID record related to the merged Research product.
You have already added 0 works in your ORCID record related to the merged Research product.
<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>
You have already added 0 works in your ORCID record related to the merged Research product.
You have already added 0 works in your ORCID record related to the merged Research product.
Splitting Arabic Texts into Elementary Discourse Units

doi: 10.1145/2601401
Splitting Arabic Texts into Elementary Discourse Units
International audience; In this article, we propose the first work that investigates the feasibility of Arabic discourse segmentation into elementary discourse units within the segmented discourse representation theory framework. We first describe our annotation scheme that defines a set of principles to guide the segmentation process. Two corpora have been annotated according to this scheme: elementary school textbooks and newspaper documents extracted from the syntactically annotated Arabic Treebank. Then, we propose a multiclass supervised learning approach that predicts nested units. Our approach uses a combination of punctuation, morphological, lexical, and shallow syntactic features. We investigate how each feature contributes to the learning process. We show that an extensive morphological analysis is crucial to achieve good results in both corpora. In addition, we show that adding chunks does not boost the performance of our system.
- Université Paris Diderot France
- National Polytechnic Institute of Toulouse France
- University of Toulouse France
- Association for Computing Machinery United States
- University of Toulouse France
ACM Computing Classification System: ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ComputingMethodologies_PATTERNRECOGNITION
Microsoft Academic Graph classification: Treebank Punctuation media_common.quotation_subject media_common Artificial intelligence business.industry business Feature (machine learning) Set (abstract data type) Scheme (programming language) computer.programming_language computer Computer science Segmentation Discourse representation theory Natural language processing computer.software_genre Supervised learning
Intelligence artificielle, Apprentissage, Logique en informatique, Informatique et langage, Discourse Segmentation, Elementary Discourse Units, Arabic Language, General Computer Science, Elementary Discourse Units, Arabic Language, Discourse Segmentation, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-LO]Computer Science [cs]/Logic in Computer Science [cs.LO], [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Intelligence artificielle, Apprentissage, Logique en informatique, Informatique et langage, Discourse Segmentation, Elementary Discourse Units, Arabic Language, General Computer Science, Elementary Discourse Units, Arabic Language, Discourse Segmentation, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-LO]Computer Science [cs]/Logic in Computer Science [cs.LO], [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
ACM Computing Classification System: ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ComputingMethodologies_PATTERNRECOGNITION
Microsoft Academic Graph classification: Treebank Punctuation media_common.quotation_subject media_common Artificial intelligence business.industry business Feature (machine learning) Set (abstract data type) Scheme (programming language) computer.programming_language computer Computer science Segmentation Discourse representation theory Natural language processing computer.software_genre Supervised learning
21 references, page 1 of 3
Abdul-Mageed, M., and Diab, M. 2012. AWATIF: A multi-genre corpus for modern standard Arabic subjectivity and sentiment analysis. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'12).
Aboaoga, M., and Ab-Aziz, M. J. 2013. Arabic person names recognition by using a rule based approach. J. Comput. Sci. 9, 7, 922-927.
Abu-Jbara, A. King, B. Diab, M., and Radev, D. 2013. Identifying opinion subgroups in Arabic online discussions. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics - Short Papers (ACLShortPapers'13).
Afantenos, S. D., Denis, P., Muller, P., and Danlos, L. 2010. Learning recursive segments for discourse parsing. In Proceedings of the International Conference on Language Resources and Evaluation (LREC'10). [OpenAIRE]
Afantenos, S., Asher, N., Benamara, F., Bras, M., Fabre, C., Ho-Dac, M., Draoulec, A. L., Muller, P., Pery-Woodley, M.-P., Prevot, L., Rebeyrolles, J., Tanguy, L., Vergez-Couret, M., and Vieu, L. 2012. An empirical resource for discovering cognitive principles of discourse organisation: The annodis corpus. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'12). [OpenAIRE]
Ali Mohammed M. and Omar N. 2011. Rule based shallow parser for Arabic language. J. Comput. Sci. 7, 10, 1505-1514.
Al-Saif, A., and Markert, K. 2010. The Leeds Arabic discourse treebank: Annotating discourse connectives for Arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC'10).
Al-Saif, A., and Markert, K. 2011. Modelling discourse relations for Arabic. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP'11).
Asher, N. and Lascarides, A. 2003. Logics of Conversation. Cambridge University Press.
Bebajiba, Y. Rosso, P. Abouenour, L. Trigui, O. Bouzoubaa, K., and Belguith, HL. 2010. Question answering for semitic languages. In Natural Language Processing Approaches to Semitic Languages, Pr. Imed Zitouni Ed., Springer, 345-347.
citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).7 popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.Average influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).Average impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.Average visibility views 65 download downloads 35 citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).7 popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.Average influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).Average impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.Average Powered byBIP!
- 65views35downloads



- Université Paris Diderot France
- National Polytechnic Institute of Toulouse France
- University of Toulouse France
- Association for Computing Machinery United States
- University of Toulouse France
International audience; In this article, we propose the first work that investigates the feasibility of Arabic discourse segmentation into elementary discourse units within the segmented discourse representation theory framework. We first describe our annotation scheme that defines a set of principles to guide the segmentation process. Two corpora have been annotated according to this scheme: elementary school textbooks and newspaper documents extracted from the syntactically annotated Arabic Treebank. Then, we propose a multiclass supervised learning approach that predicts nested units. Our approach uses a combination of punctuation, morphological, lexical, and shallow syntactic features. We investigate how each feature contributes to the learning process. We show that an extensive morphological analysis is crucial to achieve good results in both corpora. In addition, we show that adding chunks does not boost the performance of our system.