GPT-J 6B Smart Contract Vulnerabilities Model

Artifact Description This model is a version of the gpt-j-6B-smart-contract model that is fine-tuned on the vulnerable_smart_contracts dataset. It is in total 24.3 GB, split into two shards of around 12 GB. It is trained with the Transformers library and available in PyTorch format. Environment Setup The Transformers library from HuggingFace is required to load the model. Depending on the system you are using, you might need to install PyTorch from source. See here for instructions. Both Unix-based and Windows systems are supported. To load the model in float32 precision, one would need at least 2x model size RAM: 1x for initial weights and another 1x to load the checkpoint. So it would take at least 48GB RAM to just load the model. For doing inference on GPU, around 40GB of GPU memory is needed to load the model. For training/fine-tuning the model, it would require significantly more GPU memory. Getting Started The following code snippets demonstrate how to do inference with the model using the transformers library from HuggingFace. First, the tokenizer and model need to be loaded into memory. The path supplied to the tokenizer and model must be a valid directory containing a config.json file. This will be the path to the extracted directory of the downloaded "model.zip" file. After the model is loaded into RAM, the model is also moved onto the GPU if a CUDA GPU is available. import torch from transformers import AutoTokenizer, AutoModelForCausalLM device = "cuda" if torch.cuda.is_available() else "cpu" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("path/to/tokenizer/dir") tokenizer.pad_token = tokenizer.eos_token # Load model model = AutoModelForCausalLM.from_pretrained("path/to/model/dir").to(device) print("Model loaded") To activate the vulnerability-constrained decoding, the `NoBadWordsLogitsProcessor` logits processor in the Transformers library can be used by simply defining a list of list of token ids that are not allowed to be generated. This needs to be the token id of the vulnerability tokens we want to avoid. bad_words = ''.join(['<UpS>','<TO>','<IOU>','<DC>','<UcC>','<RE>','<FE>','<NC>','<TD>','<TOD>']) bad_word_ids = tokenizer(bad_words).input_ids bad_word_ids_list = [[id] for id in bad_word_ids] Then, some sample smart contract code is encoded with the initialized tokenizer and placed on the GPU (if available). prompt = """// SPDX-License-Identifier: GPL-3.0 pragma solidity >= 0.7.0; contract Coin { // Sends an amount of newly created coins to an address // Can only be called by the contract creator function mint(address receiver, uint amount) public { require(msg.sender == minter); require(amount < 1e60); balances[receiver] += amount; } // Sends an amount of existing coins // from any caller to an address""" # Tokenize encodings = tokenizer(prompt, padding=True, return_tensors="pt").to(device) Finally, the encoded text is fed to the model as input, along with the `bad_word_ids_list`. This makes the model generate secure code for the smart contract sample. When the generation is finished, the output is decoded with the tokenizer and printed. # Generate with torch.no_grad(): outputs = model.generate( **encodings, max_length=256, pad_token_id=tokenizer.eos_token_id, bad_words_ids=bad_word_ids_list, ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False) print(generated_text) To deactivate the vulnerability-constrained decoding, simply don't pass the `bad_words_ids` parameter to the generate function.

Related Organizations

Norwegian University of Science and Technology
Norway

Keywords

Smart Contracts, Transformer, Smart Contract, Code Generation

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average