artifact_detection - A tool for NLP tasks on textual bug reports.

artifact_detection A tool for NLP tasks on textual bug reports. Automated classification of text into natural language (e.g. English in the contained datasets), and non-natural language text portions (e.g. stack traces, code snippets, log outputs, file listings, urls,) on a line by line basis. This repo contains the Python implementation of a machine learning classifier model, basic scripts for automated trainingset creation from GitHub issue tickets. Further, a scikit-learn transformer implementation wrapping pretrained models ready to be used as preprocessing step. Datasets consist of issue tickets and documentation files mined from C++, Java, JavaScript, PHP, and Python projects hosted on GitHub. Detailed discussion of this model can be found in "Detecting non-natural language artifacts for de-noising bug reports" - Hirsch T. and Hofer B. (in review). This is project is also available on GitHub: https://github.com/AmadeusBugProject/artifact_detection

Related Organizations

Graz University of Technology
Austria

Keywords

bug report, nlp, data cleaning

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average