Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2021
License: CC BY
Data sources: ZENODO; Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2021
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Identifying Machine-Paraphrased Plagiarism

Authors: Wahle, Jan Philip; Ruas, Terry; Foltynek, Tomas; Meuschke, Norman; Gipp, Bela;

Identifying Machine-Paraphrased Plagiarism

Abstract

README.txt Title: Identifying Machine-Paraphrased Plagiarism Authors: Jan Philip Wahle, Terry Ruas, Tomas Foltynek, Norman Meuschke, and Bela Gipp contact email: wahle@gipplab.org; ruas@gipplab.org; Venue: iConference Year: 2022 ================================================================ Dataset Description: Training: 200,767 paragraphs (98,282 original, 102,485paraphrased) extracted from 8,024 Wikipedia (English) articles (4,012 original, 4,012 paraphrased using the SpinBot API). Testing: SpinBot: arXiv - Original - 20,966; Spun - 20,867 Theses - Original - 5,226; Spun - 3,463 Wikipedia - Original - 39,241; Spun - 40,729 SpinnerChief-4W: arXiv - Original - 20,966; Spun - 21,671 Theses - Original - 2,379; Spun - 2,941 Wikipedia - Original - 39,241; Spun - 39,618 SpinnerChief-2W: arXiv - Original - 20,966; Spun - 21,719 Theses - Original - 2,379; Spun - 2,941 Wikipedia - Original - 39,241; Spun - 39,697 ================================================================ Dataset Structure: [human_evaluation] folder: human evaluation to identify human-generated text and machine-paraphrased text. It contains the files (original and spun) as for the answer-key for the survey performed with human subjects (all data is anonymous for privacy reasons). NNNNN.txt - whole document from which an extract was taken for human evaluation key.txt.zip - information about each case (ORIG/SPUN) results.xlsx - raw results downloaded from the survey tool (the extracts which humans judged are in the first line) results-corrected.xlsx - at the very beginning, there was a mistake in one question (wrong extract). These results were excluded. [automated_evaluation]: contains all files used for the automated evaluation considering [spinbot] (https://spinbot.com/API) and [spinnerchief] (http://developer.spinnerchief.com/API_Document.aspx). Each paraphrase tool folder contains: [corpus] and [vectors] sub-folders. For [spinnerchief], two variations are included, with 4-word-chaging ratio (default) and 2-word-chaging ratio. [vectors] sub-folder contains the average of all word vectors for each paragraph. Each line has the number of dimensions of the word embeddings technique used (see paper for more details) followed by its respective class (i.e., label mg or og). Each file belongs to one class, either "mg" or "og". The values are comma-separated (.csv). The extension is .arff can be read as a normal .txt file. The word embedding technique used is described in the file name with the following structure: <technique>-<type>-mean-<data>.arff . Where <technique> - d2v - doc2vec google - word2vec fasttextnw - fastText without subwording fasttextsw - fastText with subwording glove - Glove Details for each technique used can be found in the paper. <type> - arxivp - arXiv paragraph split thesisp - Theses paragraph split wikip - Wikipedia paragraph split (wikipedia_paragraph_vector_train are the vectors used for training. It follows the same wikip structure) Details for each technique used can be found in the paper referenced at the start of this README file. [corpus] sub-folder: contains de raw text (No pre-processing) used for train and test at a paragraph level. The Spun paragraphs used for training are only generated using the SpinBot tool. For test both SpinBot and SpinnerChief are used. The paragraph split is generated by selecting paragraphs from the original documents with 3 or more sentences. Each folder is divided in mg (i.e., machine-generated through SpinBot and SpinnerChief) and og (i.e., original-generated file). the document split is not avaiable since our experiments only use the paragraph level. Machine Learning models: SVM, Naive Bayes, and Logistic Regression. The grid search for hyperparameter adjustments for the machine learning classifiers is described in the paper. @incollection{WahleRFM22, title = {Identifying {{Machine-Paraphrased Plagiarism}}}, booktitle = {Information for a {{Better World}}: {{Shaping}} the {{Global Future}}}, author = {Wahle, Jan Philip and Ruas, Terry and Folt{\’y}nek, Tom{\’a}{\v s} and Meuschke, Norman and Gipp, Bela}, editor = {Smits, Malte}, year = {2022}, volume = {13192}, pages = {393--413}, publisher = {{Springer International Publishing}}, address = {{Cham}}, doi = {10.1007/978-3-030-96957-8_34}, isbn = {978-3-030-96956-1 978-3-030-96957-8}, } For our previous publication using only SpinBot and Wikipedia articles for document and paragraph split, please see the following publication. The dataset used is hosted in DeepBlue

Related Organizations
Keywords

paraphrase detection, word embeddings, document classification, plagiarism detection

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 2K
    download downloads 493
  • 2K
    views
    493
    downloads
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
download
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
downloads
OpenAIRE UsageCountsDownloads provided by UsageCounts
0
Average
Average
Average
2K
493