Building and updating software datasets: an empirical assessment

This is the repository for the scripts and data of the study "Building and updating software datasets: an empirical assessment". Data collected The data generated for the study it can be downloaded as a zip file. Each folder inside the file corresponds to one of the datasets of projects employed in the study (qualitas, currentSample and qualitasUpdated). Every dataset comprised three files "class.csv", "method.csv" and "sample.csv", with class metrics, method metrics and repository metadata of the projects respectively. Here is a description of the datasets: qualitas: includes code metrics and repository metrics from the projects in the release 20130901r of the Qualitas Corpus. currentSample: includes code metrics and repository metrics from a recent sample collected with our sampling procedure. qualitasUpdated: includes code metrics and repository metrics from an updated version of the Qualitas Corpus applying our maintenance procedure. Plot graphics To plot the results and graphics in the article there is a Jupyter Notebook "Experiment.ipynb". It is initially configured to use the data in "datasets" folder. Replication Kit For replication purposes, the datasets which contains recent projects from Github can be re-generated. To do so, the virtual environment must have installed the dependencies in "requirements.txt" file, add Github's tokens in "./token" file, download the source code scanner Sourcemeter, re-define or leave as is the paths declared in the constants (variables written in caps) in the main method, and finally run "main.py" script. The script comprise 5 steps: Project retrieval from Github: at first the sampling frame with projects complying with a specific quality criteria are retrieved from Github's API. Create samples: with the sampling frame retrieved, the current samples are selected (currentSample and qualitasUpdated). In the case of qualitasUpdated, it is important to have first the "sample.csv" file inside the qualitas folder of the dataset originally created for the study. This file contains the metadata of the projects in Qualitas Corpus. Project download and analysis: when all the samples are selected from the sampling frame (currentSample and qualitasUpdated), the repositories are downloaded and scanned with SourceMeter. In the cases in which the analysis is not possible, the projects are replaced with another one with similar size. Outlier detection: once the datasets are collected, it is necessary to manually look for possible outliers in the code metrics under study. In the notebook "Experiment.ipynb" there are specific sections dedicated for it ("Outlier detection (Section 4.2.2)"). Outlier replacement: when the outliers are detected, in the same notebook there is also a section for outlier replacement ("Replace Outliers") where the outliers' url have to be listed to find the appropriate replacement. If it is required, the metrics from the Qualitas Corpus can also be re-generated. First, it is necessary to download the release 20130901r from its official webpage. Second, decompress the .tar files downloaded. Third, make sure that the compressed files with source code from the projects (.java files) are placed in the "compressed" folder, in some cases it is necessary to read the "QC_README" file in the project's folder. Finally, run the original main script "Generate metrics for the Qualitas Corpus (QC) dataset" part of the code.

Related Organizations

National University of the Northeast
Argentina
Centro Científico Tecnológico - Nordeste
Argentina
National Scientific and Technical Research Council
Argentina

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average