
arXiv: 2205.09379
AbstractContextGitHub is the world's most prominent host of source code, with more than 327M repositories. However, most of these repositories are not labelled or inadequately, making it harder for users to find relevant projects. Various proposals for software application domain classification over the past years have been proposed. However, these several of those approaches suffer from multiple issues, called antipatterns of software classification, that reduce their usability.ObjectiveIn this paper, we propose a new taxonomy in the GitHub ecosystem, called GitRanking, starting from a well‐structured data set, composed of curated repositories annotated with topics. The main objective is to create a baseline methodology for software classification that is expandable, hierarchical, grounded in a knowledge base, and free of antipatterns.MethodWe collected 121K topics from GitHub and used GitRanking to create a taxonomy of 301 ranked application domains. GitRanking (1) uses active sampling to ensure a minimal number of annotations to create the ranking; and (2) links each topic to Wikidata, reducing ambiguities and improving the reusability of the taxonomy. Furthermore, we adopt the conceived taxonomy in a classification task by considering a state‐of‐the‐art classifier.ResultsOur results show that GitRanking can effectively rank terms in a hierarchy according to how general or specific their meaning is. Furthermore, we show that GitRanking is a dynamically extensible method: it can currently accept further terms to be ranked, and with a minimum number of annotations (). Concerning the classification task, we show that the model achieves an F1‐score of 34%, with a precision of 54%.ConclusionThis paper is the first collective attempt at building a ground‐up taxonomy of software domains. Our vision is that our taxonomy, and its extensibility, can be used to better and more precisely label software projects.
GitHub, Software Engineering (cs.SE), FOS: Computer and information sciences, taxonomy, Computer Science - Software Engineering, Computer Science - Machine Learning, active sampling; GitHub; software classification; taxonomy, software classification, active sampling, Information Retrieval (cs.IR), Computer Science - Information Retrieval, Machine Learning (cs.LG)
GitHub, Software Engineering (cs.SE), FOS: Computer and information sciences, taxonomy, Computer Science - Software Engineering, Computer Science - Machine Learning, active sampling; GitHub; software classification; taxonomy, software classification, active sampling, Information Retrieval (cs.IR), Computer Science - Information Retrieval, Machine Learning (cs.LG)
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 6 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
