This repository contains the project work for the "Natural Language Processing" (NLP) course. The main objective of this project was to fine-tune a Large Language Model (LLM) into a chatbot using the OpenAssistant-Guanaco dataset.
- Balice Matteo
- Doronzo Antonio Giuseppe
- Fabris Filip
- Masini Alessandro
- Dataset Analysis:
- Calculated and visualized basic statistics such as average document length and average vocabulary size.
- Word2Vec Embedding:
- Trained a Word2Vec embedding on the data and analyzed its properties.
- Document Clustering:
- Clustered the documents and visualized the clusters to identify groups or known classes.
- Indexed the documents for keyword search functionality.
- Fine-Tuning:
- Fine-tuned three different models: Meta Llama 2 7B, Meta Llama 3 8B, and Microsoft Phi-3-mini-4k-instruct.
- Evaluation:
- Evaluated and compared the performance of the fine-tuned models against their original versions.
- Cross-Dataset Evaluation:
- Investigated model performance on the Stanford Question Answering Dataset (SQuAD).
- Gradio Application:
- Developed a Gradio application for user interaction with the fine-tuned Meta Llama 3 8B model on the Guanaco dataset.
- The application features:
- A text box for user input.
- A text box displaying the system response generated by the model.
- A submit button to trigger the system response.