CodeLLaMa Fine-tune

This repository houses code dedicated to fine-tuning the Code Llama 2 model for the specific task of predicting method names. In essence, when presented with code snippets from programming languages like Python and Java, the model is trained to produce an appropriate method name as its output.
This specialized task is referred to as code infilling, wherein the model is conditioned to generate code that aligns with an existing code's beginning and end sections. The model is fine-tuned with data scraped from the official Intellij community repository.

Objective and Future Directions

The main objective of this repository is to showcase the adaptation of downstream models and the outcomes to validate the feasibility of the task. Potential future improvements, such as saving/reloading models, expanding to support additional languages, and refining the code and methodology to support a larger dataset, could be considered.

Module Development Specifications

This module is developed using python 3.9 and CUDA Version: 11.7 and requires GPU access. While other combinations might be possible, they haven't been verified through testing.

Setup

Steps to steup the module.

Clone intellij-community repository:

git clone https://github.com/JetBrains/intellij-community

Clone this repository:

git clone https://github.com/KosPsych/codellama-finetune

Add path to a dataset folder and the intellij repository to constants.py
Create virtual environment:

conda create --name <name> python==3.9

Activate virtual environment:

conda activate <name>

Install dependencies:

pip install -r requirements.txt

Running

If you've followed the aforementioned steps, the module should now be installed and prepared for use.

Usage 1

Retrieve methods from the repository using this module. It's designed to parse a repository and capture Java and Python methods, saving them in a text file with the delimiter "###END###". To initiate this process, execute the following command:

python3 src/dataset_creation.py n_files

Here, n_files represents the number of files you wish to process from the specified repository. For instance, if n_files is set to 100, the module will scrape 100 Java files and 100 Python files.

Prior to running the process, ensure that any files within the dataset folder are removed.

Usage 2

Using the dataset created by python3 src/dataset_creation.py n_files one can fine tune Code Llama to predict method names with:

python3 src/experiments.py lang

Where lang represents the selected programming language, which can be either java or python.
This code assesses the base model's performance on the dataset, showcasing accuracy and similarity (measured via the Levenshtein distance) for the task of predicting method names. Following this, the model undergoes fine-tuning using peft. The refined model is then reassessed on the dataset to monitor metric enhancements. The outcomes are presented in the terminal.