This repo provides code to fine-tune LLAMA2-7B by performing 4-bit quantization the model weights and use QLoRA to update the weights for Code Summarization task with CodeSearchNet dataset.
In my experiments the model was trained on NVIDIA A100 GPU - 40GB in Google Colab Platform.
Note: The model was fine-tuned on the python
subset of the dataset.
- pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
- pip install datasets
- pip install bitsandbytes
- pip install einops
- pip install huggingface_hub
- pip install pytorch-ignite
The dataset used is the CodeSearchNet, which comprises 6 million functions from open source code. It is divided into six languages: python
, java
, javascript
, php
, ruby
and go
. Furthermore, the dataset is split into three parts: train
, valid
and test
. The dataset is available in the JSONL
format.
You can download dataset from the google drive link, or use the following command.
pip install gdown
mkdir data data/code2nl
cd data/code2nl
gdown https://drive.google.com/uc?id=1rd2Tc6oUWBo7JouwexW3ksQ0PaOhUr6h
unzip Cleaned_CodeSearchNet.zip
rm Cleaned_CodeSearchNet.zip
cd ../..
The dataset is pre-processed using the data_preprocessing.ipynb
notebook. It reads the original dataset in jsonl
file and constructs prompts for the model to train, and saves the data along with prompts as a csv
file. The text
column (prompts) in the csv
file is then used to train the model.
We construct prompts for LLAMA-2, a Decoder-Only architecture, using continuous text data to fine-tune and update model weights. Unlike Encoder-Decoder architecture, source code cannot be used as input and target code-summary as output.
To train the model, with the hyper-parameters used in the code, run the cells on an instance with A100 GPU. The model will be saved in the results
directory.
Before running the cells, make sure to upload the dataset from pre-processing step in csv
format to the HuggingFace dataset instance. Also, change the dataset_name
variable to the dataset name in HuggingFace.
At the notebook's end, the model is uploaded to the HuggingFace model hub, where it can be downloaded for inference.
The model hyper-parameters can be adjusted to experiment with various configurations, depending on the available hardware.
You can fine-tune various models, such as Mistral AI, by modifying the model_name variable in the code. Choose any model from the HuggingFace model hub, and it will be downloaded and fine-tuned on your dataset.