/SciPhy-RAG

Research Project for Summer Undergraduate Research Fellowship (SURF) 2023

Primary LanguageJupyter Notebook

SciPhy-RAG

  • This project was awarded the Summer Undergraduate Research Fellowship 2023 and is done under the guidance of Dr. Rajiv Ratn Shah from MIDAS-IIITD.
  • It aims to apply techniques from the domain of controllable scientific text generation to High School Physics
  • The main idea behind the research project stems from the hypothesis that Physics Word Problems (PWPs) require understanding of concepts based on physics formulae and is thus a fundamentally different task from Math Word Problems (MWPs).

Data Collection and Augmentation:

  • Topics from Indian High School Physics textbooks are collected alongwith questions from datasets such as SCIMAT (Kollepara et al. 2021) which consist of inconsistencies we fix.
  • Linear transformations are applied on the questions to augment the data to a bigger size based off the idea that linearly transformed questions will help the language model better understand the underlying concept.

Fine-Tuning Vicuna using LoRA

  • Vicuna is a state-of-the-art model, and fine-tuning it can yield superior results for specific applications. This document provides an overview of how we fine-tuned Vicuna using the LoRA technique for both 8-bit and 16-bit.

  • Low-Rank Adaptation or LoRA (Hu et al. 2021) is a method used to efficiently fine-tune large neural networks by decomposing the weight matrix to lower rank matrices. By adapting only a small part of the model, it allows for quicker updates and can yield significant benefits in performance, especially when there's infrastructure for fine-tuning.

  • The rank of the matrix is adjusted for achieving 8-bit and 16-bit quantisation.

  • We refer to the following repository for helping us fine-tune using LoRA: Link

  • We use our hand-annotated dataset comprising 9.5K physics questions. We divide that into a training and testing split and fine-tune the model on the training set using supervised fine-tuning.

  • We check for inference on the test set.

Retrieval Experiment Setup (SciPhy-RAG):

  • Wikipedia Articles are extracted using similarity search on sub-topics and the title of Wikipedia pages.
  • These are stored as embeddings in a vector database (e.g. Pinecone).
Screenshot 2023-09-21 at 3 11 45 AM
  • At the time of inference when running the model, the question is sent to the vector database. Here Approximate Nearest Neighbor (ANN) search is applied to find N relevant passages for solving the question.
  • The question and N relevant passages are then sent as input-prompt to the Language Model for solving the question. The inference is checked on the test again to get the results.

Final Results:

Screenshot 2023-09-21 at 3 16 00 AM
  • We release our data augmentation codes for generating the dataset alongwith the train and testing questions used.
  • We additionally release the code for the retrieval pipeline.