publication . Conference object . Other literature type . 2017

Connecting Resources: Which Issues Have to be Solved to Integrate CMC Corpora from Heterogeneous Sources and for Different Languages?

Beißwenger, Michael; Wigham, Ciara R.; Etienne, Carole; Fišer, Darja; Grumt Suárez, Holger; Laura Herzberg; Hinrichs, Erhard; Horsmann, Tobias; Karlova-Bourbonus, Natali; Lemnitzer, Lothar; ...
Open Access English
  • Published: 03 Oct 2017
  • Publisher: HAL CCSD
  • Country: France
Abstract
International audience; The paper reports on the results of a scientific colloquium dedicated to the creation of standards and best practices which are needed to facilitate the integration of language resources for CMC stemming from different origins and the linguistic analysis of CMC phenomena in different languages and genres. The key issue to be solved is that of interoperability-with respect to the structural representation of CMC genres, linguistic annotations metadata, and anonymization/pseudonymization schemas. The objective of the paper is to convince more projects to partake in a discussion about standards for CMC corpora and for the creation of a CMC c...
Subjects
ACM Computing Classification System: ComputingMilieux_COMPUTERSANDSOCIETY
free text keywords: corpora, research infrastructures, annotation, anonymization, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, anonymisation
Zenodo
Other literature type . 2017
Provider: Datacite
Zenodo
Other literature type . 2017
Provider: Datacite
HAL-ENS-LYON
Conference object . 2017
24 references, page 1 of 2

Barbaresi, A. (2016). Efficient construction of metadata-enhanced web corpora. In Proceedings of the 10th Web as Corpus Workshop, Association for Computational Linguistics, pp. 7-16. https://hal.archivesouvertes.fr/hal-01371704v2/document.

Beißwenger, M., Ermakova, M., Geyken, A., Lemnitzer, L., Storrer, A. (2012). A TEI Schema for the Representation of Computer-mediated Communication. Journal of the Text Encoding Initiative 3. http://jtei. revues.org/476 (DOI: 10.4000/jtei.476). [OpenAIRE]

Beißwenger, M., Lüngen, H., Schallaböck, J., Weitzmann, J.H., Herold, A., Kamocki, P., Storrer, A., Wildgans, J. (2017, in press). Rechtliche Bedingungen für die Bereitstellung eines Chat-Korpus in CLARIN-D: Ergebnisse eines Rechtsgutachtens. In: M. Beißwenger (Ed.), Empirische Erforschung internetbasierter Kommunikation. Berlin/New York: de Gruyter (Empirische Linguistik / Empirical Linguistics). [OpenAIRE]

Beißwenger, M., Chanier, T., Erjavec, T., Fišer, D., Herold, A., Lubešic, N., Lüngen, H., Poudat, C., Stemle, E., Storrer, A., Wigham, C. (2017a). Closing a Gap in the Language Resources Landscape: Groundwork and Best Practices from Projects on Computer-mediated Communication in four European Countries. In: L. Borin (Ed.), Selected papers from the CLARIN Annual Conference 2016, Aix-en-Provence, 26-28 October 2016 (Linköping University Electronic Conference Proceedings 136), pp. 1-18. http://www.ep.liu.se/ecp/ contents.asp?issue=136

Chanier, T., Jin, K. (2013). Defining the online interaction space and the TEI structure for CoMeRe corpora. Projet CoMeRe (Communication Médiée par les Réseaux). https://corpuscomere.files.wordpress.com/ 2014/01/tei-cmc-comere-interactionspace_131231.pdf

Chanier, T., Poudat, C., Sagot, B., Antoniadis, G., Wigham, C., Hriba, L., Longhi, J., Seddah, D. (2014). The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres. Journal of language Technology and Computational Linguistics, 29(2), pp. 1-30. http://www.jlcl.org/2014_Heft2/ 1Chanier-et-al.pdf. [OpenAIRE]

Fišer, D., Erjavec, T., Ljubešić, N. (2017). The compilation, processing and analysis of the Janes corpus of Slovene user-generated content: In C.R. Wigham, G. Ledegen (Eds.), Corpus de Communication Médiée par les Réseaux. Construction, structuration, analyse. Paris: L'Harmattan (Humanités numériques), pp. 125-138.

Fišer, D., Erjavec, T., Ljubešić, N. (2016). JANES v0.4: Korpus slovenskih spletnih uporabniških vsebin. Slovenščina 2.0, 4(2), pp. 67-99.

Frey, J.C., Glaznieks, A., Stemle, E.W. (2015). The DiDi Cor-pus of South Tyr-olean CMC Data. In Proceedings of the 2nd Workshop on Natural Language Processing for Computer-Mediated Communication / Social Media (NLP4CMC2015), Essen, Germany.

Frey, J.-C., Glaznieks, A., Stemle, E. (2016). The DiDi Corpus of South Tyrolean CMC Data: A Multilingual Corpus of Facebook Texts. In Proceedings of the Third Italian Conference on Computational Linguistics (CLiC-it 2016). ceur-ws.org/Vol-1749/paper27.pdf.

Geyken, A., Barbaresi, A., Didakowski, J., Jurish, B., Wiegand, F., Lemnitzer, L. (2017, in press). Die Korpusplattform des „Digitalen Wörterbuchs der deutschen Sprache“ (DWDS). Zeitschrift für germanistische Linguistik, 45 (2). [OpenAIRE]

Grumt Suárez, H., Karlova-Bourbonus, N., Lobin, H. (2016). Compilation and Annotation of the Discourse-structured Blog Corpus for German. In Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities (cmc-corpora2016), Ljubljana, Slovenia. http://nl.ijs.si/janes/wp-content/ uploads/2016/09/CMC-2016_Grumt_et_al_Compilati on-and-Annotation.pdf.

Ho-Dac, L.-M., Laippala, V. (2017). Le corpus WikiDisc, une ressource pour la caractérisation des discussions en ligne. In C.R. Wigham, G. Ledegen (Eds.), Corpus de Communication Médiée par les Réseaux. Construction, structuration, analyse. Paris: L'Harmattan (Humanités numériques), pp. 107-124. [OpenAIRE]

Longhi, J., Wigham, C.R. (2015). Structuring a CMC corpus of political tweets in TEI: corpus features, ethics and workflow. Poster at Corpus Linguistics 2015, Lancaster, United Kingdom. https://halshs.archivesouvertes.fr/halshs-01176061. [OpenAIRE]

Lüngen, H. (2017). DEREKO - Das Deutsche Referenzkorpus. Schriftkorpora der deutschen Gegenwartssprache am Institut für Deutsche Sprache in Mannheim. Zeitschrift für germanistische Linguistik, 45 (1), pp. 161-170. [OpenAIRE]

24 references, page 1 of 2
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue