CAPYBARA: Decompiled Binary Functions and Related Summaries

CAPYBARA This dataset is published as part of the paper: "Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries". It includes both the training/evaluation data as well as the raw data. The data_split folder contains .pickle files with the test and validation repos, the train repos are all the remaining repos. It also contains a .pickle file with a dictionary that specifies the optimization level for each repository. In the processed_data folder, the processed datasets can be found in .csv format. The columns of the CSV are the `summaries`, the `original documentation`, the `repo`, the `source` and `decompiled` code, the `function name` and a unique `identifier`. We also include the deduplicated samples in separate CSVs. The processed training files can be found in the training_data folder. `Source C`, `decompiled`, `demiStripped`, and `stripped` can each be found in their corresponding folders and are split into deduplicated and regular datasets. The data is further split into .jsonl files, for the train, test, and validation sets. These .jsonl files can be loaded into CodeT5 and CodeXGlue as is. The raw_data folder contains all the stripped and decompiled functions without any pre-processing applied. The columns of the CSV are `repo`, the `location`, the `original` code, the corresponding `decompiled` code, the `function name`, a unique `identifier` key, and the corresponding `documentation` for both the decompiled and stripped functions. License Copyright 2022 ########## Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Related Organizations

Delft University of Technology
Netherlands
University of California, Davis
United States

Keywords

Code Summarization, Reverse Engineering, Binary

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average