Downloads provided by UsageCounts
This dataset accompanies a research paper that introduces a novel system designed to support the Wikipedia community in combating vandalism on the platform. The dataset has been prepared to enhance the accuracy and efficiency of Wikipedia patrolling in multiple languages. The release of this comprehensive dataset aims to encourage further research and development in vandalism detection techniques, fostering a safer and more inclusive environment for the Wikipedia community. Researchers and practitioners can utilize this dataset to train and validate their models for vandalism detection and contribute to improving online platforms' content moderation strategies. Dataset Details: Number of Languages: 47 Observation period: 6 months training, one week hold-out testing Use Case: The dataset is primarily intended for training and evaluating vandalism detection systems. Features: Each record characterizes the corresponding revision of the Wikipedia page, including revision metadata, user details, text inserted, removed, or changed, and corresponding MLMs-based features. Data Filtering and Feature Engineering: Advanced filtering and feature engineering techniques were applied to ensure the dataset's quality and relevance for effectively training the vandalism detection system. Files: Training and hold-out testing datasets of anonymous and all users. Related paper citation: @inproceedings{10.1145/3580305.3599823, author = {Trokhymovych, Mykola and Aslam, Muniza and Chou, Ai-Jou and Baeza-Yates, Ricardo and Saez-Trumper, Diego}, title = {Fair Multilingual Vandalism Detection System for Wikipedia}, year = {2023}, isbn = {9798400701030}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3580305.3599823}, doi = {10.1145/3580305.3599823}, abstract = {This paper presents a novel design of the system aimed at supporting the Wikipedia community in addressing vandalism on the platform. To achieve this, we collected a massive dataset of 47 languages, and applied advanced filtering and feature engineering techniques, including multilingual masked language modeling to build the training dataset from human-generated data. The performance of the system was evaluated through comparison with the one used in production in Wikipedia, known as ORES. Our research results in a significant increase in the number of languages covered, making Wikipedia patrolling more efficient to a wider range of communities. Furthermore, our model outperforms ORES, ensuring that the results provided are not only more accurate but also less biased against certain groups of contributors.}, booktitle = {Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {4981–4990}, numpages = {10}, location = {Long Beach, CA, USA}, series = {KDD '23} }
This work has been funded by MCIN/AEI /10.13039/501100011033 under the Maria de Maeztu Units of Excellence Programme (CEX2021-001195-M)
algorithm fairness, vandalism, content reliability, Wikipedia, knowledge integrity
algorithm fairness, vandalism, content reliability, Wikipedia, knowledge integrity
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
| views | 59 | |
| downloads | 4 |

Views provided by UsageCounts
Downloads provided by UsageCounts