Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ figsharearrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
figshare
Dataset . 2018
License: CC 0
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
figshare
Dataset . 2018
License: CC 0
Data sources: Datacite
DRYAD
Dataset . 2019
License: CC BY
Data sources: Datacite
versions View all 3 versions
addClaim

ArXiV Archive

Authors: Geiger, R.Stuart;

ArXiV Archive

Abstract

Step 0: Query from arxiv.org Arxiv's main permitted means of bulk downloading article metadata is through its OAI-PMH API. I used the oai-harvest program to download this, which stores the records in one XML file per paper, for a total of about 1.4 million files. These files are too large to be uploaded here. Step 1: Process XML files In the Jupyter notebook 1-process-xml-files.ipynb, the individual XML files are processed into a single large Pandas DataFrame, which is stored in TSV and pickle formats. These files are too large to be uploaded here. Step 2: Process categories and output to per_year and per_category TSVs In the Jupyter notebook 2-process-categories-out.ipynb, the large TSV file created in step 1 is parsed and separated into two different batched outputs. The processed_data/per_year folder contains one TSV file per year, compressed in .zip format. The processed_data/per_category contains one TSV file per Arxiv category, compressed in .xz format. Arxiv papers have primary and secondary categories (posting and cross-posting), and papers are in a category's dataset if they were either posted or cross-posted to that category. Step 3: Export raw titles and abstracts In the Jupyter notebook 3-abstracts-export.ipynb, the per_year datasets are unpacked and merged, then two sets of files are created for 1) just abstracts and 2) just titles, with one title or abstract per line. This creates zipped files for all items (too large to upload on GitHub) and a random sample of 250k items, which can be found in processed_data/DUMP_DATE/arxiv-abstracts-250k.txt.zip and processed_data/DUMP_DATE/arxiv-titles-250k.txt.zip.

Example usage Jupyter notebook In the Jupyter notebook 4-analysis-examples.ipynb, the per_year datasets are unpacked and merged to one large dataframe, which is then analyzed in various ways. If you are looking to use this data to do an analysis on the entire Arxiv, you may find this notebook useful to start. Data dictionary for full metadata files These files are in processed_data/DUMP_DATE/per_year/YEAR.tsv.zip and processed_data/DUMP_DATE/per_category/CATEGORY_NAME.tsv.zip, with one row per line and tab-separated. Variable name Definition Example abstract Text of the abstract, may include LaTeX formatting. We find the natural embedding of the (R+R^2)-i... acm_class ACM Classification (manually entered by authors, if exists) arxiv_id Arxiv internal ID. Can get to PDF by appending "https://arxiv.org/pdf/" + arxiv_id + ".pdf" 1011.0240 author_text Comma-separated list of authors Sergei V. Ketov, Alexei A. Starobinsky categories Comma-separated list of all categories the paper was submitted to (posted and cross-posted) hep-th,astro-ph.CO,gr-qc comments Author comments 4 pages, revtex, no figures (very minor additi... created Date created (YYYY-MM-DD) 2010-10-31 doi DOI (manually entered by authors, if exists) 10.1103/PhysRevD.83.063512 num_authors Number of authors 2 num_categories Number of categories 3 primary_cat Primary category the paper was submitted to hep-th title Paper title, may include LaTeX Embedding (R+R^2)-Inflation into Supergravity updated Date last updated (YYYY-MM-DD) 2011-02-28 created_ym Year and month created 2010-10

This is a full archive of metadata about papers on arxiv.org from 1993-2018, including abstracts. Data is tidy and packed in TSV files, in two different collections of the total dataset: per year (all categories) and per primary category (all years). This archive also includes Jupyter notebooks for unpacking and analyzing it in python. See the README.md file and https://github.com/staeiou/arxiv_archive for more information.

Related Organizations
  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 20
  • 20
    views
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
0
Average
Average
Average
20