SQuaD: The Software Quality Dataset - Dataset

This is a re-direction Zenodo repository that presents the "SQuaD: The Software Quality Dataset" submitted to MSR 2026 Data and Tool Showcase Track, and provides the link address to each of the supplementary materials (see below). Version: 1.0 DOI: https://doi.org/10.5281/zenodo.17566690Authors: Mikel Robredo, Matteo Esposito, Davide Taibi, Rafael Peñaloza, Valentina LenarduzziAffiliations: University of Oulu, University of Southern Denmark, University of Milano-Bicocca Access and Usage The dataset and all supplementary materials are available through Zenodo and IDA* repositories: CSV Raw Data (IDA): https://doi.org/10.23729/fd-c528d131-2c8c-3e61-91f1-a075931e73dc MongoDB BSON (IDA): https://doi.org/10.23729/fd-f9dc7d2c-0465-3991-961f-56128ee518d0 Replication Package (Zenodo):https://doi.org/10.5281/zenodo.17541471 On IDA: IDA (ida.fairdata.fi) is a research data storage service organized by the Finnish Ministry of Education and Culture and produced by CSC — IT Center for Science. The service is intended for storing stable research data, both raw data and processed data, which is included to research datasets published in the FAIRdata (FAIR: Findable, Accessible, Interoperable, and Reusable) Etsin service. The service is offered free of charge to users affiliated with Finnish universities and polytechnics and Finnish research institutes. Each link corresponds to a specific data access format, along with replication scripts and diagrams for database structure. Main abbreviations: Static Analysis Tool (SAT): A software static analysis tool is an automated program that examines a software's source code without executing it to find potential bugs, security vulnerabilities, and deviations from coding standards. Issue Tracking System (ITS): A software issue tracking system is a tool used to manage and track software bugs, feature requests, and other problems from initial report to final resolution. It acts as a centralized database, allowing teams to create, assign, and monitor issues, ensuring a structured and organized approach to problem-solving and collaboration. Overview The Software Quality Dataset (SQuaD) is a multi-dimensional, time-aware collection of software quality metrics extracted from 450 mature open-source projects across diverse ecosystems, including Apache, Mozilla, FFmpeg, and the Linux kernel. SQuaD integrates nine state-of-the-art Static Analysis Tools (SATs) and combines both product and process metrics to support large-scale empirical research on software quality, maintainability, evolution, and technical debt. This dataset was submitted to a major software engineering conference in 2025 and is the result of a seven-month large-scale mining effort. Dataset Summary Attribute Description Projects analyzed 450 open-source projects Releases analyzed 63,586 releases/tags Static Analysis Tools 9 tools (SonarQube, CodeScene, PMD, Understand, CK, JaSoMe, RefactoringMiner, RefactoringMiner++, PyRef) Unique metrics 725 metrics Defect tickets 628,178 Commits analyzed 2,622,413 Detected vulnerabilities 1,479 CVEs and 175 CWEs Average project age 9 years Average LOC per project 125,500 Average GitHub stars 2,465 Average contributors 104 Data Contents The dataset includes a variety of entities and metric tables, covering product, process, and vulnerability information.Each entity corresponds to a CSV table or a MongoDB collection: Table Description PROJECTS GitHub repository metadata COMMITS Commit hash, message, date, author alias ISSUES Issue tickets from GitHub, Jira, and Bugzilla RELEASES Identifiers of project releases and related commit hashes GITHUB_METRICS Stars, contributors, watchers, and project statistics PRJ_ITS_VLN_LINKAGE Links between projects, issue trackers, and detected vulnerabilities CVE / CWE Official vulnerability and weakness data from NIST and MITRE PROCESS_METRICS 14 process metrics computed for each release TOOL tables Output metrics from each SAT at method, class, file, and project levels Available Formats SQuaD is distributed in two complementary formats to facilitate different research and analysis needs: 1. CSV Format Each entity is provided as a separate CSV file. Ideal for direct exploration, statistical analysis, and integration into scripts or notebooks. Mirrors the same relational structure as the MongoDB database. 2. MongoDB Format A NoSQL version of the dataset is provided as a compressed BSON dump (Zstandard-compressed). Can be imported into MongoDB for scalable querying and time-aware analyses. Recommended for researchers dealing with large-scale data analytics or custom pipelines. NOTE: - The full data weighs approximately 1.9 TB, so ensure sufficient storage and RAM before extraction and import. Step 1 — Decompress the Archive (Zstandard) The dataset is distributed as a .tar.zst file. To extract it, install Zstandard and decompress as follows: # Install Zstandard (if not already installed) sudo apt install zstd # Decompress the archive (this may take several hours) unzstd SQuaD_MongoDB_Dump.tar.zst # Extract the BSON dump files tar -xvf SQuaD_MongoDB_Dump.tar Step 2 — Import into MongoDB Once decompressed, you can import each collection using mongorestore (bundled with MongoDB tools): # Example: restore entire database mongorestore --db squad_db /path/to/SQuaD_MongoDB_Dump Methodology Overview The dataset construction follows four key stages (illustrated in the paper’s Figure 1): Mining version control data Cloned 501 repositories (filtered to 450 active, mature projects). Retrieved commits, tags, issues, and metadata from issue tracking systems (ITS) such as GitHub, Jira, and Bugzilla. Mining software quality metrics Applied nine SATs in parallel across all releases. Extracted metrics at multiple granularity levels (method, class, file, project). Extracting vulnerabilities Parsed CVE and CWE references from issue tickets. Fetched official vulnerability descriptions via NIST and MITRE APIs. Collecting process metrics Computed 14 release-level process metrics (e.g., churn, contributor count, commit density) using GitPython. Research Opportunities SQuaD provides a comprehensive foundation for a variety of software engineering research domains: Software evolution and maintainability analysis Defect prediction and Just-In-Time learning Technical debt and code smell benchmarking Refactoring impact analysis Software vulnerability detection and risk assessment Transformer-based and AI-driven quality modeling Its combination of product and process metrics supports both statistical and machine learning–based investigations. Acknowledgments This work was supported by: CSC – IT Center for Science, Finland (Mahti Supercomputer, Allas Cloud Storage, cPouta services) FAST Doctoral Research Network, funded by the Finnish Ministry of Education and Culture SciTools, for providing academic support and licenses for Understand

Related Organizations

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Related to Research communities

UArctic