publication . Other literature type . Preprint . External research report . 2020

A Dataset for GitHub Repository Deduplication: Extended Description

Spinellis, Diomidis; Kotti, Zoe; Mockus, Audris;
Open Access English
  • Published: 21 Apr 2020
  • Publisher: Zenodo
Abstract
Comment: 33 pages, 33 figures, 17 listings
Subjects
free text keywords: deduplication, fork, project clone, GitHub, dataset, Computer Science - Software Engineering
Funded by
EC| FASTEN
Project
FASTEN
Fine-Grained Analysis of Software Ecosystems as Networks
  • Funder: European Commission (EC)
  • Project Code: 825328
  • Funding stream: H2020 | IA
Download fromView all 4 versions
Zenodo
Other literature type . 2020
Provider: Datacite
Zenodo
Other literature type . 2020
Provider: Datacite
ZENODO
External research report . 2020
Provider: ZENODO
37 references, page 1 of 3

[1] Amritanshu Agrawal, Akond Rahman, Rahul Krishna, Alexander Sobran, and Tim Menzies. 2018. We Don't Need Another Hero? The Impact of "Heroes" on Software Development. In Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP '18). Association for Computing Machinery, New York, NY, USA, 245-253. https://doi.org/10. 1145/3183519.3183549

[2] Miltiadis Allamanis. 2019. The Adverse Efects of Code Duplication in Machine Learning Models of Code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward! '19). Association for Computing Machinery, New York, NY, USA, 143-153. https://doi.org/10.1145/3359591.3359735

[3] Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In International Conference on Machine Learning (ICML '16). 2091-2100. https://arxiv.org/pdf/ 1602.03001.pdf

[4] Peter Allmark. 2004. Should Research Samples Reflect the Diversity of the Population? Journal of medical ethics 30 (May 2004), 185-189. https://doi.org/10. 1136/jme.2003.004374 [OpenAIRE]

[5] Daniel Arp, Michael Spreitzenbarth, Hugo Gascon, and Konrad Rieck. 2014. Drebin: Efective and Explainable Detection of Android Malware in your Pocket. In Proceedings of the 21st Annual Network and Distributed System Security Symposium (NDSS '14). The Internet Society. http://user.informatik.uni-goettingen.de/ %7Ekrieck/docs/2014-ndss.pdf

[6] Sebastian Baltes and Stephan Diehl. 2019. Usage and Attribution of Stack Overlfow Code Snippets in GitHub Projects. Empirical Software Engineering 24, 3 (June 2019), 1259-1295. https://doi.org/10.1007/s10664-018-9650-5

[7] Victor R. Basili, Forrest Shull, and Filippo Lanubile. 1999. Building Knowledge through Families of Experiments. IEEE Trans. Softw. Eng. 25, 4 (July 1999), 456-473. https://doi.org/10.1109/32.799939

[8] John Businge, Moses Openja, Sarah Nadi, Engineer Bainomugisha, and Thorsten Berger. 2018. Clone-Based Variability Management in the Android Ecosystem. In Proceedings of the 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME '18). 625-634. https://doi.org/10.1109/ICSME.2018.00072

[9] Roberto de la Cruz and Jan-Ulrich Kreft. 2018. Geometric mean extension for data sets with zeros. Available online https://arxiv.org/abs/1806.06403. arXiv:stat.AP/1806.06403

[10] Emden R. Gansner and Stephen C. North. 2000. An Open Graph Visualization Experience 30, 11 (2000), 1203-1233. https://doi.org/10.1002/1097-024X(200d009) System and its Applications to Software Engineering. Software: Practice and 30:11<1203::AID-SPE338>3.3.CO;2-E e of the 10th Working Conference on Mining Software Repositories (MhSR'13). IEEE

[11] Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Proceedings Press, Piscataway, NJ, USA, 233-236. https://doi.org/10.5555/2s487085.2487132

[12] Georgios Gousios, Martin Pinzger, and Arie van Deursen. 201i4. An Exploratory r Study of the Pull-Based Software Development Model. lInProceedings of the 36th International Conference on Software Engineering (ICSE '14). Association ofor b Computing Machinery, New York, NY, USA, 345-355. https://doi.org/10.f1145/ 2568225.2568260 u

[13] Georgios Gousios and Diomidis Spinellis. 2012. GHTorrent: Github's tData from a Firehose. In 9th IEEE Working Conference on pMining Software Repoositories (MSR), //doi.org/10.1109/MSR.2012.6224294 n Michele Lanza, Massimiliano Di Penta, and Tao Xie (Eds.). IEEE, 12-21. https:

[14] Georgios Gousios and Diomidis Spinellis. 2017. Mining SNoftware Engineering Data from GitHub. In Proceedings Uthe of 39th International Conference on Software Engineering Companion (ICSE-C '17). IEEE Press, Piscataway, NJ, USA, 501-502. https://doi.org/10.1109/ICSE-C.2017.164 Technical Briefing.

[15] Georgios Gousios, Margaret-Anne Storey, and Alberto Bacchelli. 2016. Work Practices and Challenges in Pull-Based Development: The Contributor's Perspective. In Proceedings of the 38th International Conference on Software Engineering (ICSE '16). ACM, 285-296. https://doi.org/10.1145/2884781.2884826 [OpenAIRE]

37 references, page 1 of 3
Abstract
Comment: 33 pages, 33 figures, 17 listings
Subjects
free text keywords: deduplication, fork, project clone, GitHub, dataset, Computer Science - Software Engineering
Funded by
EC| FASTEN
Project
FASTEN
Fine-Grained Analysis of Software Ecosystems as Networks
  • Funder: European Commission (EC)
  • Project Code: 825328
  • Funding stream: H2020 | IA
Download fromView all 4 versions
Zenodo
Other literature type . 2020
Provider: Datacite
Zenodo
Other literature type . 2020
Provider: Datacite
ZENODO
External research report . 2020
Provider: ZENODO
37 references, page 1 of 3

[1] Amritanshu Agrawal, Akond Rahman, Rahul Krishna, Alexander Sobran, and Tim Menzies. 2018. We Don't Need Another Hero? The Impact of "Heroes" on Software Development. In Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP '18). Association for Computing Machinery, New York, NY, USA, 245-253. https://doi.org/10. 1145/3183519.3183549

[2] Miltiadis Allamanis. 2019. The Adverse Efects of Code Duplication in Machine Learning Models of Code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward! '19). Association for Computing Machinery, New York, NY, USA, 143-153. https://doi.org/10.1145/3359591.3359735

[3] Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In International Conference on Machine Learning (ICML '16). 2091-2100. https://arxiv.org/pdf/ 1602.03001.pdf

[4] Peter Allmark. 2004. Should Research Samples Reflect the Diversity of the Population? Journal of medical ethics 30 (May 2004), 185-189. https://doi.org/10. 1136/jme.2003.004374 [OpenAIRE]

[5] Daniel Arp, Michael Spreitzenbarth, Hugo Gascon, and Konrad Rieck. 2014. Drebin: Efective and Explainable Detection of Android Malware in your Pocket. In Proceedings of the 21st Annual Network and Distributed System Security Symposium (NDSS '14). The Internet Society. http://user.informatik.uni-goettingen.de/ %7Ekrieck/docs/2014-ndss.pdf

[6] Sebastian Baltes and Stephan Diehl. 2019. Usage and Attribution of Stack Overlfow Code Snippets in GitHub Projects. Empirical Software Engineering 24, 3 (June 2019), 1259-1295. https://doi.org/10.1007/s10664-018-9650-5

[7] Victor R. Basili, Forrest Shull, and Filippo Lanubile. 1999. Building Knowledge through Families of Experiments. IEEE Trans. Softw. Eng. 25, 4 (July 1999), 456-473. https://doi.org/10.1109/32.799939

[8] John Businge, Moses Openja, Sarah Nadi, Engineer Bainomugisha, and Thorsten Berger. 2018. Clone-Based Variability Management in the Android Ecosystem. In Proceedings of the 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME '18). 625-634. https://doi.org/10.1109/ICSME.2018.00072

[9] Roberto de la Cruz and Jan-Ulrich Kreft. 2018. Geometric mean extension for data sets with zeros. Available online https://arxiv.org/abs/1806.06403. arXiv:stat.AP/1806.06403

[10] Emden R. Gansner and Stephen C. North. 2000. An Open Graph Visualization Experience 30, 11 (2000), 1203-1233. https://doi.org/10.1002/1097-024X(200d009) System and its Applications to Software Engineering. Software: Practice and 30:11<1203::AID-SPE338>3.3.CO;2-E e of the 10th Working Conference on Mining Software Repositories (MhSR'13). IEEE

[11] Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Proceedings Press, Piscataway, NJ, USA, 233-236. https://doi.org/10.5555/2s487085.2487132

[12] Georgios Gousios, Martin Pinzger, and Arie van Deursen. 201i4. An Exploratory r Study of the Pull-Based Software Development Model. lInProceedings of the 36th International Conference on Software Engineering (ICSE '14). Association ofor b Computing Machinery, New York, NY, USA, 345-355. https://doi.org/10.f1145/ 2568225.2568260 u

[13] Georgios Gousios and Diomidis Spinellis. 2012. GHTorrent: Github's tData from a Firehose. In 9th IEEE Working Conference on pMining Software Repoositories (MSR), //doi.org/10.1109/MSR.2012.6224294 n Michele Lanza, Massimiliano Di Penta, and Tao Xie (Eds.). IEEE, 12-21. https:

[14] Georgios Gousios and Diomidis Spinellis. 2017. Mining SNoftware Engineering Data from GitHub. In Proceedings Uthe of 39th International Conference on Software Engineering Companion (ICSE-C '17). IEEE Press, Piscataway, NJ, USA, 501-502. https://doi.org/10.1109/ICSE-C.2017.164 Technical Briefing.

[15] Georgios Gousios, Margaret-Anne Storey, and Alberto Bacchelli. 2016. Work Practices and Challenges in Pull-Based Development: The Contributor's Perspective. In Proceedings of the 38th International Conference on Software Engineering (ICSE '16). ACM, 285-296. https://doi.org/10.1145/2884781.2884826 [OpenAIRE]

37 references, page 1 of 3
Any information missing or wrong?Report an Issue