/OpenAssistant-Guanaco-NLP-Project

This is a repository for the project OpenAssistant-Guanaco of the Natural Language Processing course

Primary LanguageJupyter Notebook

Project Description

This repository contains the project work for the "Natural Language Processing" (NLP) course. The main objective of this project was to fine-tune a Large Language Model (LLM) into a chatbot using the OpenAssistant-Guanaco dataset.

Team Members

  • Balice Matteo
  • Doronzo Antonio Giuseppe
  • Fabris Filip
  • Masini Alessandro

Tasks Overview

1. Data Analysis

  • Dataset Analysis:
    • Calculated and visualized basic statistics such as average document length and average vocabulary size.
  • Word2Vec Embedding:
    • Trained a Word2Vec embedding on the data and analyzed its properties.
  • Document Clustering:
    • Clustered the documents and visualized the clusters to identify groups or known classes.
    • Indexed the documents for keyword search functionality.

2. Model Training

  • Fine-Tuning:
    • Fine-tuned three different models: Meta Llama 2 7B, Meta Llama 3 8B, and Microsoft Phi-3-mini-4k-instruct.
  • Evaluation:
    • Evaluated and compared the performance of the fine-tuned models against their original versions.

3. Extensions

  • Cross-Dataset Evaluation:
    • Investigated model performance on the Stanford Question Answering Dataset (SQuAD).
  • Gradio Application:
    • Developed a Gradio application for user interaction with the fine-tuned Meta Llama 3 8B model on the Guanaco dataset.
    • The application features:
      • A text box for user input.
      • A text box displaying the system response generated by the model.
      • A submit button to trigger the system response.