Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2024
License: CC BY NC SA
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2020
License: CC BY NC SA
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2020
License: CC BY NC SA
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
https://doi.org/10.5281/zenodo...
Dataset . 2020
License: CC BY NC SA
Data sources: Sygma
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
https://doi.org/10.5281/zenodo...
Dataset . 2020
License: CC BY NC SA
Data sources: Sygma
ZENODO
Dataset . 2024
License: CC BY NC SA
Data sources: Datacite
ZENODO
Dataset . 2024
License: CC BY NC SA
Data sources: Datacite
versions View all 5 versions
addClaim

KITAB Text Reuse Data

Authors: Smith, David; Savant, Sarah Bowen; Romanov, Maxim; Muther, Ryan; Seydi, Masoumeh; Merchant, Sohail;

KITAB Text Reuse Data

Abstract

KITAB is funded by the European Research Council under the European Union’s Horizon 2020 research and innovation programme, awarded to the KITAB project (Grant Agreement No. 772989, PI Sarah Bowen Savant), hosted at Aga Khan University, London. In addition, it has received funding from the Qatar National Library to aid in the adaptation of the passim algorithm for Arabic. KITAB’s text reuse data is generated by running passim on the OpenITI corpus (DOI: 10.5281/zenodo.3082463). Each version is the output of a separate run and the version number corresponds to the corpus releases. To prepare the corpus for a passim run, we normalize texts and remove most of the non-Arabic characters and then chunk the texts into passages of 300 words (using the non-Arabic characters, including white space) in length. The chunks, called milestones, are identified by unique ids. This dataset represents the reuse cases that have been identified among milestones. The text reuse dataset consists of folders for each book. Each folder includes CSV files of the text reuse cases (alignments) between the corresponding book and all other books with which passim has found instances of reuses. The files have the below naming convention, using the book ids: _.csv (e.g., ‘Shamela000001.mARkdown_Shamela000002.csv’). The CSV files are not the immediate output of passim, rather the result of the post-processing step. The folder structure is as below (for a total of four books, for example). bookVersionID1 |- bookVersionID1_bookVersionID4.csv |- bookVersionID1_bookVersionID3.csv bookVersionID4 |-bookVersionID4_bookVersionID3.csv Where we do not have any CSV files in any of the folders, it means that the passim algorithm has not been able to find any text reuse cases for that specific book. In the above example, we can not find any folder or CSV files for bookVresionID2, that means no reuse cases are detected between book2 and of the other three books. To save computational resources, we generate text reuse data uni-directionally, which means a pair of documents is compared only once (document1 to document2, not document2 to document1). The alignments the CSV files are a list of records. Each record shows a pair of matched passages between two books together with statistics, such as the algorithm score, and contextual information, such as the start and end positions of aligned passages so that one can find those passages in the books. A description of the alignment fields is given in the release notes. For each dataset, we also generate statistical data on the alignments between the book pairs. The data is published in an application that facilitates search, filtering, and visualizations. The link to the corresponding application is given in the release notes. Note on Release Numbering: Version 2020.1.1—where 2020 is the year of the release, the first dotted number—.1—is the ordinal release number in 2020, and the second dotted number—.1—is the overall release number. The first dotted number will reset every year, while the second one will continue on increasing. Note: The very first release of the KITAB text reuse data (2019.1.1) is published here as it was too big to publish on Zenodo. To receive more information on the complete datasets please contact us via kitab-project@outlook.com (or other team members). Future releases may include part of the generated data if the size of whole data is too big to publish on Zenodo. However, the data is open access for anyone to use. We provide the detailed information on the datasets in the corresponding release notes.

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 10
  • 10
    views
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
0
Average
Average
Average
10
Funded by