<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

Preprocessed Java Code Corpus

Name: Preprocessed Java Code Corpus
Keywords: source code, language model, java, bpe, software engineering

Research datakeyboard_double_arrow_right Dataset 27 Jan 2020Publisher:ZenodoFunded by:UKRI | EPSRC Centre for Doctoral...

Authors: Rafael - Michael Karampatsis; Hlib Babii; Romain Robbes; Charles Sutton; Andrea Janes;

doi: 10.5281/zenodo.3628521 , 10.5281/zenodo.3628665 , 10.5281/zenodo.3628522

Preprocessed Java Code Corpus

- Summary
- Subjects
- Related research
  (4)
- Metrics

Abstract

A preprocessed code corpus for the Java programming language. The corpus was used for the experiments in the paper Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code. It contains preprocessed-tokenized files for training, validation, testing, and BPE encoding learning. The BPE segmented versions of the above files are also included for three different encoding sizes i,e., 2000, 5000, and 10000 BPE merge operations as well as the learned BPE encodings. Similar versions are also contained for splitting compound identifiers on camelCase and snake_case as in (Allamanis et al., 2015) as well as the corresponding subtoken maps.

Related Organizations

Keywords

source code, language model, java, bpe, software engineering

Filter by relation

All relations

arrow_drop_down

4 Research products, page 1 of 1

Pre-trained neural language models (NLMs) for code
2020IsSourceOf
Preprocessed Python Code Corpus
2020IsAmongTopNSimilarDocuments
Preprocessed C Code Corpus
2020IsAmongTopNSimilarDocuments
ePSIC-DLS/ParticleSpy: v0.5.2
2020IsAmongTopNSimilarDocuments

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average