Python Annotated Code Search (PACS) Datasets &amp; Pretrained Models

This upload contains datasets and pre-trained models used for the paper Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent. The code for easily loading these datasets and models will be made available here: http://github.com/nokia/codesearch Datasets There are three types of datasets: snippet collections (code snippets + natural language descriptions): so-ds-feb20, staqc-py-cleaned, conala-curated code search evaluation data (queries linked to relevant snippets of one of the snippet collections): so-ds-feb20-{valid|test}, staqc-py-raw-{valid|test}, conala-curated-0.5-test training data (datasets used to train code retrieval models): so-duplicates-pacs-train, so-python-question-titles-feb20 The staqc-py-cleaned snippet collection, and the conala-curated datasets were derived from existing corpora: staqc-py-cleaned was derived from the Python StaQC snippet collection. See https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset, LICENSE. conala-curated was derived from the conala corpus. See https://conala-corpus.github.io/ , LICENSE The other datasets were mined directly from a recent Stack Overflow dump (https://archive.org/details/stackexchange, LICENSE). Pre-trained models Each model can embed queries and (annotated) code snippets in the same space. The models are released under a BSD 3-Clause License. ncs-embedder-so-ds-feb20 ncs-embedder-staqc-py tnbow-embedder-so-ds-feb20 use-embedder-pacs ensemble-embedder-pacs

Keywords

code search, machine learning, software reuse

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility	views	62
download	downloads	142

62
views
142
downloads
Powered by

Found an issue? Give us feedback

visibility

download

0

Average

62

142