Replication Package of "Eliciting Best Practices for Collaboration with Computational Notebooks"

This is the replication package of the paper: "Eliciting Best Practices for Collaboration with Computational Notebooks" In the following, we describe the contents of each file archived in this repository. dataset.tar.bz2 contains: the dataset of Jupyter notebooks we retrieved from Kaggle (/cscw2021_dataset.tar.bz2); the notebook we used to filter the available Kaggle kernels based on our research criteria (Meta_Kaggle_filtering/notebooks/Meta_Kaggle_filtering.ipynb); the specific version of the Meta Kaggle dataset (October 27, 2020) that we used to perform the filtering (Meta_Kaggle_filtering/data/MetaKaggle 27-10-2020 (KT version)). notebook_analysis.tar.bz2 contains: cscw2021_db_dump.sql.tar.bz2, a PostgreSQL dump of the database with all the data we extracted from the notebooks; notebook_analysis_scripts/, the scripts by Pimentel et al. that we extended to analyze our dataset of notebooks (see the dedicated section in this README). notebook_linting.tar.bz2 contains the Python modules we developed to check code quality in Jupyter notebooks via pylint. Best Practices in The Most Upvoted Notebooks.pdf contains a table that summarizes and compares the results of our quantitative analysis for each of the studied notebook samples. Notebooks Analysis The scripts in the notebook_analysis/notebook_analysis_scripts folder were developed by Pimentel et al. and shared on Zenodo [1] as the replication package of their article: "A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks" [2]. To perform our analysis, we had to make little extensions to the original code, mainly because it was meant to automatically retrieve notebooks from GitHub. For our purposes, we were interested in analyzing a dataset of notebooks stored in a local folder on our machine. To avoid a major refactoring of the original scripts, we resorted to an expedient solution. We put each Jupyter Notebook from our dataset in a distinct directory and initializated such directory as a git repository. The result was a folder comprising 1386 git repositories, one per notebook. Afterward, we made the following additions to the original scripts: main_with_crawler_custom.py, to be called instead of main_with_crawler.py. It sequentially invokes the scripts for notebook analysis and saves the results to a PostgreSQL database. s0_local_crawler.py. This script is invoked by main_with_crawler_custom.py in place of the original s0_repository_crawler.py to crawl repositories from the local folder that we built rather than from GitHub. N.B.: the path to the local folder containing the git repositories must be specified as an environment variable called JUP_LOCALHOST_REPO_DIR. To do this, you can use the following command: export JUP_LOCALHOST_REPO_DIR="path/to/local/directory/containing/repositories" N.B.: since studying the reproducibility of Jupyter notebooks was out of the scope of our work, our main script main_with_crawler_custom.py skips the execution of s7_execute_repositories.py, the original script imputed to the re-execution of notebooks. To replicate our experiment, you can still setup your execution environment by following the original guide provided by Pimentel et al. on their Zenodo repository: https://zenodo.org/record/2592524 if you follow the section "Reproducing the Analysis" of such guide, make sure to create a PostgreSQL database extracting and using our dump: cscw2021_db_dump.sql.tar.bz2. instead, if you follow the section "Reproducing or Expanding the Collection", make sure to replace the last instruction of the guide: python main_with_crawler.py with python main_with_crawler_custom.py to invoke our main script instead of the original. N.B.: In notebook_analysis/notebook_analysis_scripts/.env you find a complete list of the environment variables that must be declared in order to execute the scripts. Customize the variable values and source the .env file to have all variables properly set up in your bash session. Notebooks Linting The folder notebooks_linting contains the Python modules that we developed to check code quality in Jupyter notebooks via pylint. To reproduce the analysis, a preliminary step is required: notebooks have to be grouped in separate folders by the Python version in which they are written. Indeed, each Python version has a dedicated version of pylint. To perform this operation, you can use the script discern_notebooks.py: it takes notebooks from the dataset folder, inspects their Python version and assigns them to the correct output folder. N.B.: Before you run discern_notebooks.py, please make sure to customize it by editing the value of the global variables DATASET_PATH -- the path to the folder containing the dataset of Jupyter notebooks -- and OUTPUT_FOLDERS_PATH -- the path to the folder which will contain the directories of notebooks grouped by language version (i.e., py27/ for notebooks written in Python 2.7, py36/ for notebooks written in Python 3.6, etc.) Once notebooks are grouped by Python version, we can set up the environment for the execution of the main script: in config.py, set the global variable DATASETS_BASE_PATH to the path of the folder containing the grouped notebooks (give DATASETS_BASE_PATH the same value that you assigned to OUTPUT_FOLDERS_PATH in discern_notebooks.py); specify the path to your anaconda or conda installation as an environment variable called JUP_ANACONDA_PATH; if you have already replicated the first part of our analysis (see the previous section, "notebooks_analysis") following the instruction provided by Pimentel et al., this environmental variable should be already set. Otherwise, you can use the following command: export JUP_ANACONDA_PATH="path/to/your/local/anaconda/installation" lastly, you should install the same conda environments that are required to execute the first part of our analysis (see the previous section, "notebooks_analysis"); follow the instruction provided by Pimentel et al. on the Zenodo repository dedicated to their project: https://zenodo.org/record/2592524. In particular, refer to the section "Reproducing or Expanding the Collection". N.B.: actually, we did not find any Kaggle notebook with Python version 2.7 nor any with version 3.8. Thus, if you are replicating our analysis on the same dataset, you can skip the installation of these conda environments. Now you can run the main script by issuing the following commands in your shell: conda activate py36 python main.py The script main.py returns linting results in the .csv format (one .csv file per each group of notebooks). The results can then be summarized by using the Jupyter notebook notebook_linting/Results analysis.ipynb. References [1] João Felipe, Leonardo, Vanessa, & Juliana. (2019). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Data set]. Zenodo. http://doi.org/10.5281/zenodo.2592524 [2] J. F. Pimentel, L. Murta, V. Braganholo and J. Freire, "A Large-Scale Study About Quality and Reproducibility of Jupyter Notebooks," 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), Montreal, QC, Canada, 2019, pp. 507-517, doi: 10.1109/MSR.2019.00077.

Keywords

Kaggle, Ecology, Sociology, collaborative systems, Science Policy, Information Systems not elsewhere classified, Genetics, Marine Biology, data science, Biological Sciences not elsewhere classified, Jupyter Notebook

EOSC Subjects

Jupyter Notebook

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility	views	5
download	downloads	6

5
views
6
downloads
Powered by

Found an issue? Give us feedback

visibility

download

0

Average

5

6