• shareshare
  • link
  • cite
  • add
Publication . Conference object . 2021

CopyCat: Near-Duplicates Within and Between the ClueWeb and the Common Crawl

Maik Fröbe; Janek Bevendorff; Lukas Gienapp; Michael Völske; Benno Stein; Martin Potthast; Matthias Hagen;
Published: 11 Jul 2021
Publisher: ACM

The amount of near-duplicates in web crawls like the ClueWeb or Common Crawl demands from their users either to develop a preprocessing pipeline for deduplication, which is costly both computationally and in person hours, or accepting the undesired effects that near-duplicates have on reliability and validity of experiments. We introduce ChatNoir-CopyCat-21, which simplifies deduplication significantly. It comes in two parts: (1) A compilation of near-duplicate documents within the ClueWeb09, the ClueWeb12, and two Common Crawl snapshots, as well as between selections of these crawls, and (2) a software library that implements the deduplication of arbitrary document sets. Our analysis shows that 14--52, of the documents within a crawl and around~0.7--2.5, between the crawls are near-duplicates. Two showcases demonstrate the application and usefulness of our resource.

Subjects by Vocabulary

Microsoft Academic Graph classification: Copycat Preprocessor Data deduplication Resource (project management) Information retrieval Pipeline (software) Computer science Reliability (computer networking) Software business.industry business

ACM Computing Classification System: Data_FILES InformationSystems_MISCELLANEOUS

Download from
Conference object . 2021
Providers: Crossref