publication . Conference object . 2015

Collection, Description, and Visualization of the German Reddit Corpus

Barbaresi, Adrien;
Open Access English
  • Published: 29 Sep 2015
  • Publisher: HAL CCSD
Abstract
International audience; Reddit is a major social bookmarking and microblogging platform. An extensive dataset of Reddit comments has recently been made publicly available. I use a two-tiered filter to single out comments in German in order to build a linguistic corpus which is then tokenized and annotated. This article offers first insights of both nature and quality of data at the lexical level. Additionally, a visualization makes it possible to grasp the possible geographical distribution of German users of the platform.
Subjects
free text keywords: Computer-mediated Communication, Web corpus construction, Information Visualization, Language Identification, [ SHS.LANGUE ] Humanities and Social Sciences/Linguistics, [ INFO.INFO-CL ] Computer Science [cs]/Computation and Language [cs.CL], [ INFO.INFO-WB ] Computer Science [cs]/Web, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], [INFO.INFO-WB]Computer Science [cs]/Web

Timothy Baldwin, Paul Cook, Marco Lui, Andrew MacKinlay, and Li Wang. 2013. How noisy social media text, how diffrnt social media sources? In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 356- 364.

Adrien Barbaresi and Kay-Michael Wu¨rzner. 2014. For a fistful of blogs: Discovery and comparative benchmarking of republishable German content. In KONVENS 2014, NLP4CMC workshop proceedings, pages 2-10. Hildesheim University Press.

Adrien Barbaresi. 2013. Crawling microblogging services to gather language-classified URLs. Workflow and case study. In Proceedings of the 51th Annual Meeting of the ACL, Student Research Workshop, pages 9-15.

Michael Beißwenger, Maria Ermakova, Alexander Geyken, Lothar Lemnitzer, and Angelika Storrer. 2013. DeRiK: A German reference corpus of computer-mediated communication. Literary and Linguistic Computing, 28(4):531-537.

Alexander Geyken. 2007. The DWDS corpus: A reference corpus for the German language of the 20th century. In Christiane Fellbaum, editor, Collocations and Idioms: Linguistic, lexicographic, and computational aspects, pages 23-41. Continuum Press.

Yingjie Hu, Krzysztof Janowicz, and Sathya Prasad. 2014. Improving Wikipedia-Based Place Name Disambiguation in Short Texts Using Structured Data from Dbpedia. In Proceedings of the 8th Workshop on Geographic Information Retrieval, pages 8-16. ACM.

Bryan Jurish and Kay-Michael Wu¨rzner. 2013. Word and Sentence Tokenization with Hidden Markov Models. JLCL, 28(2):61-83.

Bryan Jurish. 2003. A Hybrid Approach to Part-of-Speech Tagging. Final report, Kollokationen im Wo¨rterbuch, Berlin-Brandenburgische Akademie der Wissenschaften.

Jochen L Leidner and Michael D Lieberman. 2011. Detecting Geographical References in the Form of Place Names and Associated Spatial Natural Language. SIGSPATIAL Special, 3(2):5-11.

Marco Lui and Timothy Baldwin. 2012. langid.py: An Off-the-shelf Language Identification Tool. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Jeju, Republic of Korea.

Marco Lui and Timothy Baldwin. 2014. Accurate Language Identification of Twitter Messages. In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), pages 17-25.

Abstract
International audience; Reddit is a major social bookmarking and microblogging platform. An extensive dataset of Reddit comments has recently been made publicly available. I use a two-tiered filter to single out comments in German in order to build a linguistic corpus which is then tokenized and annotated. This article offers first insights of both nature and quality of data at the lexical level. Additionally, a visualization makes it possible to grasp the possible geographical distribution of German users of the platform.
Subjects
free text keywords: Computer-mediated Communication, Web corpus construction, Information Visualization, Language Identification, [ SHS.LANGUE ] Humanities and Social Sciences/Linguistics, [ INFO.INFO-CL ] Computer Science [cs]/Computation and Language [cs.CL], [ INFO.INFO-WB ] Computer Science [cs]/Web, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], [INFO.INFO-WB]Computer Science [cs]/Web

Timothy Baldwin, Paul Cook, Marco Lui, Andrew MacKinlay, and Li Wang. 2013. How noisy social media text, how diffrnt social media sources? In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 356- 364.

Adrien Barbaresi and Kay-Michael Wu¨rzner. 2014. For a fistful of blogs: Discovery and comparative benchmarking of republishable German content. In KONVENS 2014, NLP4CMC workshop proceedings, pages 2-10. Hildesheim University Press.

Adrien Barbaresi. 2013. Crawling microblogging services to gather language-classified URLs. Workflow and case study. In Proceedings of the 51th Annual Meeting of the ACL, Student Research Workshop, pages 9-15.

Michael Beißwenger, Maria Ermakova, Alexander Geyken, Lothar Lemnitzer, and Angelika Storrer. 2013. DeRiK: A German reference corpus of computer-mediated communication. Literary and Linguistic Computing, 28(4):531-537.

Alexander Geyken. 2007. The DWDS corpus: A reference corpus for the German language of the 20th century. In Christiane Fellbaum, editor, Collocations and Idioms: Linguistic, lexicographic, and computational aspects, pages 23-41. Continuum Press.

Yingjie Hu, Krzysztof Janowicz, and Sathya Prasad. 2014. Improving Wikipedia-Based Place Name Disambiguation in Short Texts Using Structured Data from Dbpedia. In Proceedings of the 8th Workshop on Geographic Information Retrieval, pages 8-16. ACM.

Bryan Jurish and Kay-Michael Wu¨rzner. 2013. Word and Sentence Tokenization with Hidden Markov Models. JLCL, 28(2):61-83.

Bryan Jurish. 2003. A Hybrid Approach to Part-of-Speech Tagging. Final report, Kollokationen im Wo¨rterbuch, Berlin-Brandenburgische Akademie der Wissenschaften.

Jochen L Leidner and Michael D Lieberman. 2011. Detecting Geographical References in the Form of Place Names and Associated Spatial Natural Language. SIGSPATIAL Special, 3(2):5-11.

Marco Lui and Timothy Baldwin. 2012. langid.py: An Off-the-shelf Language Identification Tool. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Jeju, Republic of Korea.

Marco Lui and Timothy Baldwin. 2014. Accurate Language Identification of Twitter Messages. In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), pages 17-25.

Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue