Notably Inaccessible – Data Driven Understanding of Data Science Notebook (In)Accessibility

Potluri, Venkatesh; Singanamalla, Sudheesh; Tieanklin, Nussara; Mankoff, Jennifer

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Dataset . 2023

License: CC BY

Data sources: Datacite

ZENODO

Dataset . 2023

License: CC BY

Data sources: ZENODO

ZENODO

Dataset . 2023

License: CC BY

Data sources: Datacite

Notably Inaccessible – Data Driven Understanding of Data Science Notebook (In)Accessibility

Research datakeyboard_double_arrow_right Dataset 25 Jul 2023 English Publisher:ZenodoFunded by:NSF | Using Passive Sensing to ...

Authors: Potluri, Venkatesh; Singanamalla, Sudheesh; Tieanklin, Nussara; Mankoff, Jennifer;

doi: 10.5281/zenodo.8185050 , 10.5281/zenodo.8185049

Notably Inaccessible – Data Driven Understanding of Data Science Notebook (In)Accessibility

- Summary
- Subjects
- Metrics

Abstract

Overview This dataset artifact contains the intermediate datasets from pipeline executions necessary to reproduce the results of the paper. We share this artifact in hopes of providing a starting point for other researchers to extend the analysis on notebooks, discover more about their accessibility, and offer solutions to make data science more accessible. The scripts needed to generate these datasets and analyse them are shared in the GitHub repository for this work. The dataset contains large files of approximately 60 GB so please exercise caution when extracting the data from compressed files. The dataset contains files which could take a significant amount of run time of the scripts to generate/reproduce. Dataset Contents We briefly summarize the included files in our dataset. Please refer to the documentation for specific information about the structure of the data in these files, the scripts to generate them, and runtimes for various parts of our data processing pipeline. epoch_9_loss_0.04706_testAcc_0.96867_X_resnext101_docSeg.pth: We share this model file, originally provided by Jobin et al., to enable the classification of figures found in our dataset. Please place this into the `model/` directory. model-results.csv: This file contains results from the classification performed on the figures found in the notebooks in our dataset. Performing this classification may take upto a day. a11y-scan-dataset.zip: This archive contains two files and results in datasets of approximately 60GB when extracted. Please ensure that you have sufficient disk space to uncompress this zip archive. The archive contains: a11y/a11y-detailed-result.csv: This dataset contains the accessibility scan results from the scans run on the 100k notebooks across themes. The detailed result file can be really large (> 60 GB) and can be time-consuming to construct. a11y/a11y-aggregate-scan.csv: This file is an aggregate of the detailed result that contains the number of each type of error found in each notebook. This file is also shared outside the compressed directory. errors-different-counts-a11y-analyze-errors-summary.csv: This file contains the counts of errors that occur in notebooks across different themes. nb_processed_cell_html.csv: This file contains metadata corresponding to each cell extracted from the html exports of our notebooks. nb_first_interactive_cell.csv: This file contains the necessary metadata to compute the first interactive element, as defined in our paper, in each notebook. nb_processed.csv: This file contains the necessary data after processing the notebooks extracting the number of images, imports, languages, and cell level information. processed_function_calls.csv: This file contains the information about the notebooks, the various imports and function calls used within the notebooks.

{"references": ["Jobin, K.V., Mondal, A. and Jawahar, C.V., 2019, September. Docfigure: A dataset for scientific document figure classification. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) (Vol. 1, pp. 74-79). IEEE."]}

Related Organizations

University of Mary Washington
United States
University of Mary
United States

Keywords

Data Science, Computational Notebooks, Accessibility

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average