PURE: a Dataset of Public Requirements Documents

Please cite this dataset as Ferrari, A., Spagnolo, G. O., & Gnesi, S. (2017, September). PURE: A dataset of public requirements documents. In 2017 IEEE 25th International Requirements Engineering Conference (RE) (pp. 502-505). IEEE. https://ieeexplore.ieee.org/abstract/document/8049173 This dataset presents PURE (PUblic REquirements dataset), a dataset of 79 publicly available natural language requirements documents collected from the Web. The dataset includes 34,268 sentences and can be used for natural language processing tasks that are typical in requirements engineering, such as model synthesis, abstraction identification and document structure assessment. It can be further annotated to work as a benchmark for other tasks, such as ambiguity detection, requirements categorisation and identification of equivalent re-quirements. In the associated paper, we present the dataset and we compare its language with generic English texts, showing the peculiarities of the requirements jargon, made of a restricted vocabulary of domain-specific acronyms and words, and long sentences. We also present the common XML format to which we have manually ported a subset of the documents, with the goal of facilitating replication of NLP experiments. The XML documents are also available for download. The paper associated to the dataset can be found here: https://ieeexplore.ieee.org/document/8049173/ More info about the dataset is available here: http://nlreqdataset.isti.cnr.it Preprint of the paper available at ResearchGate: https://goo.gl/HxJD7X The dataset includes: - all the documents in PDF format - a subset of 19 documents in XML format - the .xsd schema of the XML files The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Alessio Ferrari, alessio.ferrari@cnr.it, alessio.ferrari@ucd.ie) to discuss the possibility of removal of that dataset [see Zenodo's policies].

Related Organizations

Keywords

requirements, requirements documents, natural language requirements, shall requirements, NLP, PURE, public requirements, specification, requirements specifications, software engineering, requirements classification, requirements tracing, traceability, ambiguity detection, defect detection, model generation, model synthesis, software engineering dataset, requirements dataset, requirements, requirements documents, natural language requirements, shall requirements, NLP, PURE, public requirements, specification, requirements specifications, software engineering, requirements classification, requirements tracing, traceability, ambiguity detection, defect detection, model generation, model synthesis, software engineering dataset, requirements dataset

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average