Workflow analysis of data science code in public GitHub repositories

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 19 Nov 2022Embargo end date: 01 Jan 2022 Switzerland, Switzerland English Publisher:Springer Science and Business Media LLCJournal:Empirical Software Engineering, volume 28 (issn: 1382-3256, eissn: 1573-7616,

Copyright policy )Funded by:SNSF | PIG DATA: Health Analytic..., SNSF | CrowdAlytics: Large-Scale..., SNSF | Data-driven Contemporary ...

Authors: Ramasamy, Dhivyabharathi; Sarasua, Cristina; Bacchelli, Alberto; Bernstein, Abraham;

doi: 10.1007/s10664-022-10229-z , 10.5281/zenodo.7109939 , 10.5281/zenodo.7109917 , 10.5167/uzh-223404 , 10.5281/zenodo.5635476 , 10.5281/zenodo.5635475

pmid: 36420321

pmc: PMC9675706

Workflow analysis of data science code in public GitHub repositories

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

AbstractDespite the ubiquity of data science, we are far from rigorously understanding how coding in data science is performed. Even though the scientific literature has hinted at the iterative and explorative nature of data science coding, we need further empirical evidence to understand this practice and its workflows in detail. Such understanding is critical to recognise the needs of data scientists and, for instance, inform tooling support. To obtain a deeper understanding of the iterative and explorative nature of data science coding, we analysed 470 Jupyter notebooks publicly available in GitHub repositories. We focused on the extent to which data scientists transition between different types of data science activities, or steps (such as data preprocessing and modelling), as well as the frequency and co-occurrence of such transitions. For our analysis, we developed a dataset with the help of five data science experts, who manually annotated the data science steps for each code cell within the aforementioned 470 notebooks. Using the first-order Markov chain model, we extracted the transitions and analysed the transition probabilities between the different steps. In addition to providing deeper insights into the implementation practices of data science coding, our results provide evidence that the steps in a data science workflow are indeed iterative and reveal specific patterns. We also evaluated the use of the annotated dataset to train machine-learning classifiers to predict the data science step(s) of a given code cell. We investigate the representativeness of the classification by comparing the workflow analysis applied to (a) the predicted data set and (b) the data set labelled by experts, finding an F1-score of about 71% for the 10-class data science step prediction problem.

Countries

Switzerland, Switzerland

Related Organizations

University of Zurich
Switzerland

Keywords

10009 Department of Informatics, 11476 Digital Society Initiative, 000 Computer science, knowledge & systems, Article

2 Research products, page 1 of 1

A Reproducibility Study on 'Workflow Analysis of Data Science Code in Public GitHub Repositories' by Ramasamy et al.
2023IsReviewedBy
DASWOW Jupyter Notebooks Subset
2025IsSupplementedBy

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	16
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%