
Dataset for the paper "Software Documentation References to Unmaintained Repositories: An Empirical Study" submitted to ICSME 2026: Dataset 1: GitHub repositories referenced in websites: This dataset contains all GitHub repositories referenced in websites. It comprises 51,265 distinct GitHub repositories that are referenced in the analyzed websites. Dataset 2: Websites referencing GitHub repositories: This dataset contains 2,070 websites, each containing at least one reference to a GitHub repository. On the median, the 2,070 websites' repositories have 17.8K stars, 191 contributors, 121K LOC, 1.6K issues, 3.8K commits, and 69 releases. Dataset 3: Software websites referencing GitHub repositories: This dataset contains websites of the top 100 most starred repositories that are real software systems and have at least 10 references to GitHub repositories. This threshold of 10 was adopted to filter out websites that less frequently reference GitHub repositories that are not in the scope of this analysis. On the median, the 100 websites' repositories have 68.3K stars, 394 contributors, 400K LOC, 713 issues, 14.8K commits, and 190 releases. The top 3 most starred are: React, TensorFlow, and Microsoft VSCode. Dataset 4: Software documentation referencing GitHub repositories: Starting from Dataset 3, we manually inspected all webpages and selected those related to software documentation, such as learning guides, tutorials, and API references. This filtering step was conducted to remove noisy pages, including translations, outdated documentation, demos, and datasets. As a result, this manual assessment yielded 1,351 webpages containing software documentation. repos-ghs: 2,617 repositories with websites from SEART GitHub Search Engine (seart-ghs).
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
