
This replication package accompagnies the dataset and exploratory empirical analysis reported in the paper "A dataset of GitHub Actions workflow histories" published in the IEEE MSR 2024 conference. (The Jupyter notebook can be found in previous version of this dataset). Important notice : It looks like Zenodo is compressing gzipped files two times without notice, they are "double compressed". So, when you download them they should be named : x.gz.gz instead of x.gz. Notice that the provided MD5 refers to the original file. 2024-10-25 update : updated repositories list and observation period. The filters relying on date were also updated. 2024-07-09 update : fix sometimes invalid valid_yaml flag. The dataset was created as follow : First, we used GitHub SEART (on October 7th, 2024) to get a list of every non-fork repositories created before January 1st, 2024. having at least 300 commits and at least 100 stars where at least one commit was made after January 1st, 2024. (The goal of these filter is to exclude experimental and personnal repositories). We checked if a folder .github/workflows existed. We filtered out those that did not contained this folder and pulled the others (between 9th and 10thof October 2024). We applied the tool gigawork (version 1.4.2) to extract every files from this folder. The exact command used is python batch.py -d /ourDataFolder/repositories -e /ourDataFolder/errors -o /ourDataFolder/output -r /ourDataFolder/repositories_everything.csv.gz -- -w /ourDataFolder/workflows_auxiliaries. (The script batch.py can be found on GitHub). We concatenated every files in /ourDataFolder/output into a csv (using cat headers.csv output/*.csv > workflows_auxiliaries.csv in /ourDataFolder) and compressed it. We added the column uid via a script available on GitHub. Finally, we archived the folder with pigz /ourDataFolder/workflows (tar -c --use-compress-program=pigz -f workflows_auxiliaries.tar.gz /ourDataFolder/workflows) Using the extracted data, the following files were created : workflows.tar.gz contains the dataset of GitHub Actions workflow file histories. workflows_auxiliaries.tar.gz is a similar file containing also auxiliary files. workflows.csv.gz contains the metadata for the extracted workflow files. workflows_auxiliaries.csv.gz is a similar file containing also metadata for auxiliary files. repositories.csv.gz contains metadata about the GitHub repositories containing the workflow files. These metadata were extracted using the SEART Search tool. The metadata is separated in different columns: repository: The repository (author and repository name) from which the workflow was extracted. The separator "/" allows to distinguish between the author and the repository name commit_hash: The commit hash returned by git author_name: The name of the author that changed this file author_email: The email of the author that changed this file committer_name: The name of the committer committer_email: The email of the committer committed_date: The committed date of the commit authored_date: The authored date of the commit file_path: The path to this file in the repository previous_file_path: The path to this file before it has been touched file_hash: The name of the related workflow file in the dataset previous_file_hash: The name of the related workflow file in the dataset, before it has been touched git_change_type: A single letter (A,D, M or R) representing the type of change made to the workflow (Added, Deleted, Modified or Renamed). This letter is given by gitpython and provided as is. valid_yaml: A boolean indicating if the file is a valid YAML file. probably_workflow: A boolean representing if the file contains the YAML key on and jobs. (Note that it can still be an invalid YAML file). valid_workflow: A boolean indicating if the file respect the syntax of GitHub Actions workflow. A freely available JSON Schema (used by gigawork) was used in this goal. uid: Unique identifier for a given file surviving modifications and renames. It is generated on the addition of the file and stays the same until the file is deleted. Renamings does not change the identifier. Both workflows.csv.gz and workflows_auxiliaries.csv.gz are following this format.
CI/CD, Software Engineering, Mining Software Repositories, GitHub Actions, Workflows
CI/CD, Software Engineering, Mining Software Repositories, GitHub Actions, Workflows
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
