- See https://huggingface.co/hululuzhu/solidity-t5
- Hello World example to utilize the trained model
- A hello world example to use this model, notice the input
text
includes- Header solidity version like
pragma solidity ^0.5.7
- Ancestor class/library info, e.g. public functions and constants from
ParentA
- Contract/Library/Interface declaration header, e.g.
HelloWorld
ended with{
# !pip install transformers -q from transformers import AutoTokenizer, T5ForConditionalGeneration DEVICE = 'cuda' # fallback to cpu if you do not have cuda tokenizer = AutoTokenizer.from_pretrained("hululuzhu/solidity-t5") model = T5ForConditionalGeneration.from_pretrained("hululuzhu/solidity-t5").to(DEVICE) text = """pragma solidity ^0.5.7; // Context: ParentA | Functions: helloA helloB | Constants: constantA contract HelloWorld is ParentA {""" input_ids = tokenizer(text, return_tensors="pt", truncation=True).input_ids.to(DEVICE) # Need to tune beam/topk/topp params to get good outcome generated_ids = model.generate(input_ids, max_length=256, num_beams=5, top_p=0.95, top_k=50) print(tokenizer.decode(generated_ids[0], skip_special_tokens=True)) # Expect outcome """ string public constant name = "Hello World"; ... uint256 public constant override returns (uint256) { return initialSupply; } function initialSupply() public view returns (uint256) { ... """
- Header solidity version like
- A hello world example to use this model, notice the input
- Base T5 code model: https://huggingface.co/Salesforce/codet5-large
- Source data: https://huggingface.co/datasets/mwritescode/slither-audited-smart-contracts
-
Processing steps: Clean, contract-level segmentation sepration, split in and out
-
After processing input sample
pragma solidity 0.5.7; // Context: PauserRole | Functions: isPauser addPauser renouncePauser | Constants: contract Pausable is PauserRole {
-
After processing output sample (notice indentation is bad, this is intentional to reduce token size)
event Paused(address account); event Unpaused(address account); bool private _pausableActive; bool private _paused; constructor () internal { _paused = false; } function paused() public view returns (bool) { return _paused; } modifier whenNotPaused() { require(!_paused); _; } modifier whenPaused() { require(_paused); _; } function pause() public onlyPauser whenNotPaused whenPausableActive { _paused = true; emit Paused(msg.sender); } function unpause() public onlyPauser whenPaused whenPausableActive { _paused = false; emit Unpaused(msg.sender); } function _setPausableActive(bool _active) internal { _pausableActive = _active; } modifier whenPausableActive() { require(_pausableActive); _; } }
-
- Source training code: See the end to end notebook at code dir here
- The model is significantly under-trained because of lack of GPU budget, need 10x colab resources (~$100 for full train)
- This is quite limited on how the model is used, potentially we could switch to GPT2 decoder-only to compare, but CodeT5 has its strong code optimization
- Need more classifiers (T5 or BERT alike) to detect potential defects.
- This is the intention of the project, however I found it quite challenging to find the labeled data that points the exact line of code that has defect
- Technically,
- As I tested a few examples using this significantly-under-trained model and compared with chatGPT, it seems they perform similarly for code completion. That shows the great potential for this model to surpass chatGPT if we have 30x training budget (and reliable training pipeline)
- 30x budget? I trained 3 hours which only finished 10% of 1 epoch, I expect 3 epoches are reasonable size for finetune based on my personal experience
- The data is specially tuned for codeT5, thus it has the limitation of
- Split to input and output
- In/Out token size limit is 512
- Ideally we could change the transformed data, so it could be trained in BERT style to compare with T5-based classifiers, or fill-missing-splan style to add context before and after to let model output middle code
- Training more classifiers (T5 or BERT alike) to detect potential defects is the intention of the project, however I found it quite challenging to find the labeled data that points the exact line of code that has defect
- The part I divide code into segment and their ancestor info could be useful, but need more time to evaluate
- Technically,
- It is also hard to tell if my aggressive approach to remove all comments (thus rely on meaning of code only) is a good approach