Downloads provided by UsageCounts
Full-Text PDF Title: Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection Authors: Jan Philip Wahle, Terry Ruas, Norman Meuschke, and Bela Gipp Contact email: wahle@uni-wuppertal.de; ruas@uni-wuppertal.de Venue: JCDL Year: 2021 ================================================================ Dataset Description: Training: 1,474,230 aligned paragraphs (98,282 original, 1,375,948 paraphrased with 3 models and 5 hyperparameter configurations each 98,282) extracted from 4,012 (English) Wikipedia articles. Testing: BERT-large (cased): arXiv - Original - 20,966; Paraphrased - 20,966; Theses - Original - 5,226; Paraphrased - 5,226; Wikipedia - Original - 39,241; Paraphrased - 39,241; RoBERTa-large (cased): arXiv - Original - 20,966; Paraphrased - 20,966; Theses - Original - 5,226; Paraphrased - 5,226; Wikipedia - Original - 39,241; Paraphrased - 39,241; Longformer-large (uncased): arXiv - Original - 20,966; Paraphrased - 20,966; Theses - Original - 5,226; Paraphrased - 5,226; Wikipedia - Original - 39,241; Paraphrased - 39,241; ================================================================ Dataset Structure: [og] folder: original. The original documents are split by the data source with the following folders: [arxiv] [thesis] [wikipedia] [wikipedia_train] [`model_name`_mlm_prob_`probability`] (e.g., bert-large-cased_mlm_prob_0.15): contains all paraphrased examples using the model with name `model_name` and Masked Language Modeling probability `probability`. Each paraphrase model/probability folder contains the corresponding paraphrased documents according to [of]: [arxiv] [thesis] [wikipedia] [wikipedia_train] hparams.yml hparams.yml contains the hyperparameters to reconstruct the dataset using the official repository. ================================================================ Files: On the lowest folder level, each `.txt` file contains exactly one paragraph. The filename contains either "ORIG" for original, or "SPUN" for paraphrased. ================================================================ Code: To avoid misuse of the code for constructing machine-paraphrased plagiarism, you must consent to our Terms and Conditions and send the signed version via mail to one of the contact addresses above to obtain access to our repository (see TermsAndConditions.pdf).
Science Policy, machine-paraphrased plagiarism, Information Systems not elsewhere classified, Biophysics, Plant Biology, Marine Biology, paraphrased examples, Masked Language Modeling probability, Norman Meuschke, TBA, folder level, Wikipedia articles, Sociology, Neural Paraphrase Detection Authors, paraphrased documents, Genetics, contact addresses, nbsp, official repository, 3 models, Ecology, data source, 5 hyperparameter configurations, Jan Philip Wahle, Neural Language Models, Terry Ruas, Biological Sciences not elsewhere classified
Science Policy, machine-paraphrased plagiarism, Information Systems not elsewhere classified, Biophysics, Plant Biology, Marine Biology, paraphrased examples, Masked Language Modeling probability, Norman Meuschke, TBA, folder level, Wikipedia articles, Sociology, Neural Paraphrase Detection Authors, paraphrased documents, Genetics, contact addresses, nbsp, official repository, 3 models, Ecology, data source, 5 hyperparameter configurations, Jan Philip Wahle, Neural Language Models, Terry Ruas, Biological Sciences not elsewhere classified
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
| views | 49 | |
| downloads | 28 |

Views provided by UsageCounts
Downloads provided by UsageCounts