publication . Preprint . 2017

UsingWord Embedding for Cross-Language Plagiarism Detection

Ferrero, J.; Agnes, F.; Besacier, L.; Schwab, D.;
Open Access English
  • Published: 10 Feb 2017
Abstract
This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity detection methods based on distributed representation of words; (b) we combine the different methods proposed to verify their complementarity and finally obtain an overall F1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus.
Subjects
free text keywords: Computer Science - Computation and Language
Download from
25 references, page 1 of 2

Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. 2016. Massively Multilingual Word Embeddings. arXiv.org: http://arxiv.org/pdf/1602.01925v2.pdf. Computing Research Repository. [OpenAIRE]

Alberto Barro´n-Ceden˜o, Paolo Rosso, David Pinto, and Alfons Juan. 2008. On Cross-lingual Plagiarism Analysis using a Statistical Model. In Benno Stein and Efstathios Stamatatos and Moshe Koppel, editor, Proceedings of the ECAI'08 PAN Workshop: Uncovering Plagiarism, Authorship and Social Software Misuse, pages 9-13, Patras, Greece.

Alberto Barro´n-Ceden˜o. 2012. On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism. In PhD thesis, Vale`ncia, Spain.

Alexandre Berard, Christophe Servan, Olivier Pietquin, and Laurent Besacier. 2016. MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), Portoroz, Slovenia, May. European Language Resources Association (ELRA). [OpenAIRE]

Frank Vanden Berghen and Hugues Bersini. 2005. CONDOR, a new parallel, constrained extension of Powell's UOBYQA algorithm: Experimental results and comparison with the DFO algorithm. Journal of Computational and Applied Mathematics, 181:157- 175, September. [OpenAIRE]

Danqi Chen and Christopher D. Manning. 2014. A Fast and Accurate Dependency Parser using Neural Networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, pages 740-750, Doha, Qatar.

Je´re´my Ferrero, Fre´de´ric Agne`s, Laurent Besacier, and Didier Schwab. 2016. A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), Portoroz, Slovenia, May. European Language Resources Association (ELRA).

Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing Semantic Relatedness using Wikipediabased Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI'07), pages 1606-1611, Hyderabad, India, January. Morgan Kaufmann Publishers Inc.

Sahar Ghannay, Benoit Favre, Yannick Este`ve, and Nathalie Camelin. 2016. Word Embedding Evaluation and Combination. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), Portoroz, Slovenia, May. European Language Resources Association (ELRA). [OpenAIRE]

Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA Data Mining Software: An Update. In SIGKDD Explorations, volume 11, pages 10-18, July.

Quoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In Proceedings of the 31th International Conference on Machine Learning (ICML'14), volume 32, pages 1188-1196, Beijing, China, June. JMLR Proceedings. [OpenAIRE]

Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Bilingual Word Representations with Monolingual Quality in Mind. In Proceedings of the 1st NAACL Workshop on Vector Space Modeling for Natural Language Processing, Denver, Colorado, USA, May.

Paul Mcnamee and James Mayfield. 2004. Character N-Gram Tokenization for European Language Text Retrieval. In Information Retrieval Proceedings, volume 7, pages 73-97. Kluwer Academic Publishers.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS'13), pages 3111-3119, Lake Tahoe, USA, December. .

Markus Muhr, Roman Kern, Mario Zechner, and Michael Granitzer. 2010. External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for PAN at CLEF 2010. In Martin Braschler, Donna Harman, and Emanuele Pianta, editors, CLEF Notebook, Padua, Italy, September.

25 references, page 1 of 2
Abstract
This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity detection methods based on distributed representation of words; (b) we combine the different methods proposed to verify their complementarity and finally obtain an overall F1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus.
Subjects
free text keywords: Computer Science - Computation and Language
Download from
25 references, page 1 of 2

Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. 2016. Massively Multilingual Word Embeddings. arXiv.org: http://arxiv.org/pdf/1602.01925v2.pdf. Computing Research Repository. [OpenAIRE]

Alberto Barro´n-Ceden˜o, Paolo Rosso, David Pinto, and Alfons Juan. 2008. On Cross-lingual Plagiarism Analysis using a Statistical Model. In Benno Stein and Efstathios Stamatatos and Moshe Koppel, editor, Proceedings of the ECAI'08 PAN Workshop: Uncovering Plagiarism, Authorship and Social Software Misuse, pages 9-13, Patras, Greece.

Alberto Barro´n-Ceden˜o. 2012. On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism. In PhD thesis, Vale`ncia, Spain.

Alexandre Berard, Christophe Servan, Olivier Pietquin, and Laurent Besacier. 2016. MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), Portoroz, Slovenia, May. European Language Resources Association (ELRA). [OpenAIRE]

Frank Vanden Berghen and Hugues Bersini. 2005. CONDOR, a new parallel, constrained extension of Powell's UOBYQA algorithm: Experimental results and comparison with the DFO algorithm. Journal of Computational and Applied Mathematics, 181:157- 175, September. [OpenAIRE]

Danqi Chen and Christopher D. Manning. 2014. A Fast and Accurate Dependency Parser using Neural Networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, pages 740-750, Doha, Qatar.

Je´re´my Ferrero, Fre´de´ric Agne`s, Laurent Besacier, and Didier Schwab. 2016. A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), Portoroz, Slovenia, May. European Language Resources Association (ELRA).

Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing Semantic Relatedness using Wikipediabased Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI'07), pages 1606-1611, Hyderabad, India, January. Morgan Kaufmann Publishers Inc.

Sahar Ghannay, Benoit Favre, Yannick Este`ve, and Nathalie Camelin. 2016. Word Embedding Evaluation and Combination. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), Portoroz, Slovenia, May. European Language Resources Association (ELRA). [OpenAIRE]

Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA Data Mining Software: An Update. In SIGKDD Explorations, volume 11, pages 10-18, July.

Quoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In Proceedings of the 31th International Conference on Machine Learning (ICML'14), volume 32, pages 1188-1196, Beijing, China, June. JMLR Proceedings. [OpenAIRE]

Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Bilingual Word Representations with Monolingual Quality in Mind. In Proceedings of the 1st NAACL Workshop on Vector Space Modeling for Natural Language Processing, Denver, Colorado, USA, May.

Paul Mcnamee and James Mayfield. 2004. Character N-Gram Tokenization for European Language Text Retrieval. In Information Retrieval Proceedings, volume 7, pages 73-97. Kluwer Academic Publishers.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS'13), pages 3111-3119, Lake Tahoe, USA, December. .

Markus Muhr, Roman Kern, Mario Zechner, and Michael Granitzer. 2010. External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for PAN at CLEF 2010. In Martin Braschler, Donna Harman, and Emanuele Pianta, editors, CLEF Notebook, Padua, Italy, September.

25 references, page 1 of 2
Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue