Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks

João Felipe; Leonardo; Vanessa; Juliana

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Dataset . 2019

License: CC BY

Data sources: Datacite

ZENODO

Dataset . 2019

License: CC BY

Data sources: ZENODO

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks

Research datakeyboard_double_arrow_right Dataset 13 Mar 2019 English Publisher:Zenodo

Authors: João Felipe; Leonardo; Vanessa; Juliana;

doi: 10.5281/zenodo.2592524

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks

- Summary
- Subjects
- Metrics

Abstract

The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks This repository contains two files: dump.tar.bz2 jupyter_reproducibility.tar.bz2 The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks. The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows: analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database. archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks. paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again. Reproducing the Analysis This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment: Ubuntu 18.04.1 LTS PostgreSQL 10.6 Conda 4.5.11 Python 3.7.2 PdfCrop 2012/11/02 v1.38 First, download dump.tar.bz2 and extract it: tar -xjf dump.tar.bz2 It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump: psql jupyter < db2019-03-13.dump It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION: export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; Download and extract jupyter_reproducibility.tar.bz2: tar -xjf jupyter_reproducibility.tar.bz2 Create a conda environment with Python 3.7: conda create -n analyses python=3.7 conda activate analyses Go to the analyses folder and install all the dependencies of the requirements.txt cd jupyter_reproducibility/analyses pip install -r requirements.txt For reproducing the analyses, run jupyter on this folder: jupyter notebook Execute the notebooks on this order: Index.ipynb N0.Repository.ipynb N1.Skip.Notebook.ipynb N2.Notebook.ipynb N3.Cell.ipynb N4.Features.ipynb N5.Modules.ipynb N6.AST.ipynb N7.Name.ipynb N8.Execution.ipynb N9.Cell.Execution.Order.ipynb N10.Markdown.ipynb N11.Repository.With.Notebook.Restriction.ipynb N12.To.Paper.ipynb Reproducing or Expanding the Collection The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution. Requirements This time, we have extra requirements: All the analysis requirements lbzip2 2.5 gcc 7.3.0 Github account Gmail account Environment First, set the following environment variables: export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data. Scripts Download and extract jupyter_reproducibility.tar.bz2: tar -xjf jupyter_reproducibility.tar.bz2 Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option): Conda 2.7 conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology Anaconda 2.7 conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology Conda 3.4 It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation. conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2 Anaconda 3.4 conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology Conda 3.5 conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology Anaconda 3.5 It requires the manual installation of other anaconda packages. conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology Conda 3.6 conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology Anaconda 3.6 conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology Conda 3.7 conda create -n raw37 python=3.7 -y conda activate raw37 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology Anaconda 3.7 When we executed the experiments, the anaconda package for Python 3.7 was not complete. So, we attempted to install all Anaconda 3.x dependencies manually conda create -n py37 python=3.7 anaconda -y conda activate py37 conda install -y _ipyw_jlab_nb_ext_conf alabaster anaconda-client anaconda-navigator anaconda-project appdirs asn1crypto astroid astropy atomicwrites attrs automat conda install -y babel backports backports.shutil_get_terminal_size beautifulsoup4 bitarray bkcharts blaze blosc bokeh boto bottleneck bzip2 conda install -y cairo colorama constantly contextlib2 curl cycler cython conda install -y defusedxml docutils et_xmlfile fastcache filelock fribidi conda install -y get_terminal_size gevent glob2 gmpy2 graphite2 greenlet conda install -y harfbuzz html5lib hyperlink imageio imagesize incremental isort conda install -y jbig jdcal jeepney jupyter jupyter_console jupyterlab_launcher keyring kiwisolver conda install -y libtool libxslt lxml matplotlib mccabe mkl-service mpmath navigator-updater conda install -y nltk nose numpydoc openpyxl pango patchelf path.py pathlib2 patsy pep8 pkginfo ply pyasn1 pyasn1-modules pycodestyle pycosat pycrypto pycurl pyflakes pylint pyodbc pywavelets conda install -y rope scikit-image scikit-learn seaborn service_identity singledispatch spyder spyder-kernels statsmodels sympy conda install -y tqdm traitlets twisted unicodecsv xlrd xlsxwriter xlwt zope zope.interface conda install -y sortedcollections typed-ast pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology Stopwords Use nltk to download stopwords: conda activate py36 python -c "import nltk; nltk.download('stopwords')" Everything should be set to run right now. Executing In this step, we recommend using the py36 environment to orchestrate the execution. We designed the scripts for Python 3.6, and if they are correctly configured, it can invoke the other environments. conda activate py36 If you want to extend the execution to more environments, configure the environments on the file archaeology/config.py. For querying and downloading repositories from github, run on the jupyter_reproducibility/archaeology directory: python s0_repository_crawler.py For extracting data from the repositories and notebooks, run on this order: python s1_notebooks_and_cells.py python s2_requirement_files.py python s3_compress.py python s4_markdown_features.py python s5_extract_files.py python s6_cell_features.py python s7_execute_repositories.py python p0_local_possibility.py python p1_notebooks_and_cells.py python p2_sha1_exercises.py Alternatively, execute the following script that orchestrates all the executions and notifies when they finish: python main_with_crawler.py If some script fails to process all repositories/notebooks/cells, use the option "-e" to rerun it and force the re-extraction. After this process, refer to the *Reproducing the Analysis* Section for analyzing the collected data. Changelog 2019/01/14 - Version 1 - Initial version 2019/01/22 - Version 2 - Update N8.Execution.ipynb to calculate rate of failure for each reason 2019/03/13 - Version 3 - Update package for camera ready. Add columns to db to detect duplicates, change notebooks to consider it and add N1.Skip.Notebook.ipynb and N11.Repository.With.Notebook.Restriction.ipynb.

Keywords

github, jupyter notebook, reproducibility

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility	views	106
download	downloads	29

106
views
29
downloads
Powered by

Found an issue? Give us feedback

visibility

download

0

Average

106

29