/SWE-bench-COMPSCI685

Repo for COMPSCI685 final project

Primary LanguageJupyter NotebookMIT LicenseMIT

Model Eval- COMPSCI685

Model Eval emphasizes on the fine-tuning, evaluation and analysis of Large Language Models for the task of Query Generation from a text question. It is a subset of a broader problem of code generation through LLMs, where the input is a natural language description of the task to be performed for code generation, and a syntacticaly correct executable code is generated by the LLM.

The Model Eval implementation focuses on two different tasks:

  • Fine-Tuning of Pre-Trained Models on the Spider training set
  • Evaluation of the Generated Queries

Dataset:

Spider Dataset was considered for the fine-tuning and evaluation - Spider.

Large Language Models

Pre-trained Gemma 7B, Fine-tuned Gemma 7B, Fine-Tuned Gemma 2B and SWE-LLama were used for comparison.

Fine-tuning:

Gemma 2B and Gemma 7B were fine-tuned using the Spider train set of 7000 instances.

To run the fine-tuning pipeline for Gemma 7B, follow along with the Spider_Gemma7b_Finetuning.ipynb notebook. To run the fine-tuning pipeline for Gemma 2B, follow along with the Spider_Gemma2b_Finetuning.ipynb notebook.

  • T4 GPU was used to Fine-tune both the models
  • The data was loaded from Hugging Face and the models were loaded from Unsloth
  • Our fine-tuned model can be accessed here:

Inference:

Batch Inferencing can be performed on Gemma 2B, 7B. Model loading codes for both the fine-tuned models and the baseline models are provided. To use the fine-tuned models, access and copy the checkpoint folders referenced above. Before running the notebook, comment out the models that are not being used and set valid output paths to continue with inference.

Evaluation:

Three metrics were used for evaluation of the generated queries: Partial Matching, Exact Matching and ROUGE Score. Evaluation metrics are computed for the entire dataset and on subset of the dataset by splitting the instances based on Query Complexity and Query Lengths. The code for the evaluation process can be found here - Evaluation Scripts

Directory Structure:

  1. Data - Manually Annotated data categorized for Question Complexity
  2. Fine-Tuning - Code related to Gemma 2B and 7B fine-tuning
  3. Inference - Code related to Gemma 2B, 7B , Gemma Fine-tuned 7B , Gemma Fine-tuned 2B , Swe-LLama Inference for generating the outputs
  4. Evaluation - Code related to Evluation Metrics of Partial, Exact Matching, ROUGE Score for Gemma 7B , Gemma Fine-tuned 7B , Gemma Fine-tuned 2B , Swe-LLama
  5. Output - Generated Queries after inference