publication . Conference object . 2017

Using Word Embedding for Cross-Language Plagiarism Detection

Ferrero, Jérémy; Agnès, Frédéric; Besacier, Laurent; Schwab, Didier;
Open Access English
  • Published: 03 Apr 2017
  • Publisher: HAL CCSD
Abstract
International audience; This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity detection methods based on distributed representation of words; (b) we combine the different methods proposed to verify their complementarity and finally obtain an overall F 1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus.
Subjects
free text keywords: [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]

Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. 2016. Massively Multilingual Word Embeddings. arXiv.org: http://arxiv.org/pdf/1602.01925v2.pdf. Computing Research Repository. [OpenAIRE]

Alberto Barro´n-Ceden˜o, Paolo Rosso, David Pinto, and Alfons Juan. 2008. On Cross-lingual Plagiarism Analysis using a Statistical Model. In Benno Stein and Efstathios Stamatatos and Moshe Koppel, editor, Proceedings of the ECAI'08 PAN Workshop: Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Bilingual Word Representations with Monolingual Quality in Mind. In Proceedings of the 1st NAACL Workshop on Vector Space Modeling for Natural Language Processing, pages 151- 159, Denver, Colorado, USA, May.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS'13), pages 3111-3119, Lake Tahoe, USA, December. .

Markus Muhr, Roman Kern, Mario Zechner, and Michael Granitzer. 2010. External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for PAN at CLEF 2010. In Martin Braschler, Donna Harman, and Emanuele Pianta, editors, CLEF Notebook, Padua, Italy, September.

Ma`te´ Pataki. 2012. A New Approach for Searching Translated Plagiarism. In Proceedings of the 5th International Plagiarism Conference, pages 49-64, Newcastle, UK, July. [OpenAIRE]

Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), pages 2089- 2096, Istanbul, Turkey, May. European Language Resources Association (ELRA).

David Pinto, Jorge Civera, Alfons Juan, Paolo Rosso, and Alberto Barro´n-Ceden˜o. 2009. A Statistical Approach to Crosslingual Natural Language Tasks. In CEUR Workshop Proceedings, volume 64 of Journal of Algorithms, pages 51-60, January.

Martin Potthast, Benno Stein, and Maik Anderka. 2008. A Wikipedia-Based Multilingual Retrieval Model. In 30th European Conference on IR Research (ECIR'08), volume 4956 of LNCS of Lecture Notes in Computer Science, pages 522-530, Glasgow, Scotland, March. Springer.

Martin Potthast, Alberto Barro´n-Ceden˜o, Benno Stein, and Paolo Rosso. 2011. Cross-Language Plagiarism Detection. In Language Resources and Evaluation, volume 45, pages 45-62.

Martin Potthast, Matthias Hagen, Anna Beyer, Matthias Busse, Martin Tippmann, Paolo Rosso, and Benno Stein. 2014. Overview of the 6th International Competition on Plagiarism Detection. In PAN at CLEF 2014, pages 845-876, Sheffield, UK, September. [OpenAIRE]

J. Ross Quinlan. 1993. C4.5: Programs for Machine Learning. The Morgan Kaufmann series in machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

Helmut Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of the International Conference on New Methods in Language Processing, pages 44-49, Manchester, UK.

Gilles Se´rasset. 2015. DBnary: Wiktionary as a Lemon-Based Multilingual Lexical Resource in RDF. In Semantic Web Journal (special issue on Multilingual Linked Open Data), volume 6, pages 355-361.

Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. 2016. Cross-lingual Models of Word Embeddings: An Empirical Comparison. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL'16), pages 1661-1670, Berlin, Germany, August. [OpenAIRE]

Abstract
International audience; This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity detection methods based on distributed representation of words; (b) we combine the different methods proposed to verify their complementarity and finally obtain an overall F 1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus.
Subjects
free text keywords: [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]

Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. 2016. Massively Multilingual Word Embeddings. arXiv.org: http://arxiv.org/pdf/1602.01925v2.pdf. Computing Research Repository. [OpenAIRE]

Alberto Barro´n-Ceden˜o, Paolo Rosso, David Pinto, and Alfons Juan. 2008. On Cross-lingual Plagiarism Analysis using a Statistical Model. In Benno Stein and Efstathios Stamatatos and Moshe Koppel, editor, Proceedings of the ECAI'08 PAN Workshop: Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Bilingual Word Representations with Monolingual Quality in Mind. In Proceedings of the 1st NAACL Workshop on Vector Space Modeling for Natural Language Processing, pages 151- 159, Denver, Colorado, USA, May.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS'13), pages 3111-3119, Lake Tahoe, USA, December. .

Markus Muhr, Roman Kern, Mario Zechner, and Michael Granitzer. 2010. External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for PAN at CLEF 2010. In Martin Braschler, Donna Harman, and Emanuele Pianta, editors, CLEF Notebook, Padua, Italy, September.

Ma`te´ Pataki. 2012. A New Approach for Searching Translated Plagiarism. In Proceedings of the 5th International Plagiarism Conference, pages 49-64, Newcastle, UK, July. [OpenAIRE]

Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), pages 2089- 2096, Istanbul, Turkey, May. European Language Resources Association (ELRA).

David Pinto, Jorge Civera, Alfons Juan, Paolo Rosso, and Alberto Barro´n-Ceden˜o. 2009. A Statistical Approach to Crosslingual Natural Language Tasks. In CEUR Workshop Proceedings, volume 64 of Journal of Algorithms, pages 51-60, January.

Martin Potthast, Benno Stein, and Maik Anderka. 2008. A Wikipedia-Based Multilingual Retrieval Model. In 30th European Conference on IR Research (ECIR'08), volume 4956 of LNCS of Lecture Notes in Computer Science, pages 522-530, Glasgow, Scotland, March. Springer.

Martin Potthast, Alberto Barro´n-Ceden˜o, Benno Stein, and Paolo Rosso. 2011. Cross-Language Plagiarism Detection. In Language Resources and Evaluation, volume 45, pages 45-62.

Martin Potthast, Matthias Hagen, Anna Beyer, Matthias Busse, Martin Tippmann, Paolo Rosso, and Benno Stein. 2014. Overview of the 6th International Competition on Plagiarism Detection. In PAN at CLEF 2014, pages 845-876, Sheffield, UK, September. [OpenAIRE]

J. Ross Quinlan. 1993. C4.5: Programs for Machine Learning. The Morgan Kaufmann series in machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

Helmut Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of the International Conference on New Methods in Language Processing, pages 44-49, Manchester, UK.

Gilles Se´rasset. 2015. DBnary: Wiktionary as a Lemon-Based Multilingual Lexical Resource in RDF. In Semantic Web Journal (special issue on Multilingual Linked Open Data), volume 6, pages 355-361.

Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. 2016. Cross-lingual Models of Word Embeddings: An Empirical Comparison. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL'16), pages 1661-1670, Berlin, Germany, August. [OpenAIRE]

Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue
publication . Conference object . 2017

Using Word Embedding for Cross-Language Plagiarism Detection

Ferrero, Jérémy; Agnès, Frédéric; Besacier, Laurent; Schwab, Didier;