/MD-Trabalho

Mineração de Dados - Trabalho Prático (23/24)

Primary LanguageJupyter Notebook

LLM Classificator

About the Project

This project was developed as part of the Data Mining course at the University of Minho. The main objective is to determine if a text was written by a Large Language Model (LLM) and, if so, identify which LLM was used.

To achieve this goal, the interface allows users to submit a text, which is then analyzed by a LLM. The LLM has been trained to recognize texts generated by the following LLMs:

  • GPT-4 (OpenAI)
  • Meta-Llama-3-8B (Meta)
  • Phi-3-mini-128k-instruct (Microsoft)
  • Mixtral-8x7B-Instruct-v0.1 (Mistral AI)

This project involves the integration of various technologies and data mining techniques to ensure accurate and efficient analysis of the submitted texts. Additionally, the interface has been designed with a user-friendly approach to provide an intuitive and effective user experience.

Repository Organization

  • app/: Contains the web app.

    • templates/: HTML of the web app.
    • interface.py: Code of the web app.
    • requirements.txt: List of dependencies needed to run the app.
  • docs/: Documents related to the practical assignment.

    • [MD] TP - Apresentação Final.pdf: Final Presentation.
    • [MD] TP - Apresentação Inicial.pdf: Initial presentation.
    • artigofinal_grupo10.pdf: Article describing the work done on the pratical assignment.
  • gcp/: Scripts used to run google cloud VMs (Where we trained the model).

  • meta-llama/: Data related to the Meta-Llama-3-8B-Instruct responses.

    • Meta-Llama-3-8B-Instruct.zip: Zip containing the 30k responses of the Meta-Llama-3-8B-Instruct model.
    • Meta-Llama-3-8B-Instruct_valid.zip: Zip containing the valid responses of the Meta-Llama-3-8B-Instruct model (used in training).
    • Meta-Llama-3-8B-Instruct_invalid.zip: Zip containing the invalid responses of the Meta-Llama-3-8B-Instruct model.
  • microsoft/: Data related to the Phi-3-mini-4k-instruct responses.

    • Phi-3-mini-4k-instruct.zip: Zip containing the 30k responses of the Phi-3-mini-4k-instruct model.
    • Phi-3-mini-4k-instruct_valid.zip: Zip containing the valid responses of the Phi-3-mini-4k-instruct model (used in training).
    • Phi-3-mini-4k-instruct_invalid.zip: Zip containing the invalid responses of the Phi-3-mini-4k-instruct model.
  • notebook/: Contains the notebook where the model was trained.

    • question_classificator_bert_training.ipynb: Notebook used to train the model. It also has the results of the training.
  • mistralai/: Data related to the Mixtral-8x7B-Instruct-v0.1 responses.

    • Mixtral-8x7B-Instruct-v0.1.zip: Zip containing the 30k responses of the Mixtral-8x7B-Instruct-v0.1 model.
    • Mixtral-8x7B-Instruct-v0.1_valid.zip: Zip containing the valid responses of the Mixtral-8x7B-Instruct-v0.1 model (used in training).
    • Mixtral-8x7B-Instruct-v0.1_invalid.zip: Zip containing the invalid responses of the Mixtral-8x7B-Instruct-v0.1 model.
  • openai/: Data related to the GPT4 responses.

    • GPT4.zip: Zip containing the 30k responses of the GPT4 model.
    • GPT4_valid.zip: Zip containing the valid responses of the GPT4 model (used in training).
    • GPT4_invalid.zip: Zip containing the invalid responses of the GPT4 model.
    • filtered_file.csv: File with the dataset containing the GPT4 responses in the correct format.
    • train.csv: Original data file.
  • questions/: Data related to the questions applied to the models.

    • questions.json: JSON file with the 30k questions.
  • scripts/: Data pre-processing scripts

    • GPT4_to_json.py: Transforms the GPT answers from csv format into json format.
    • clean_questions.py: Cleans the responses of the models, in order to have clean and valid texts for the training phase.
    • get_questions.py: Stores the questions.
    • script.ipynb: Simplify the csv containing the answers provided by GPT 4, collecting only the important columns in our case.
  • .gitignore: Git Ignore

  • README.md: This file, providing an overview of the project and repository organization.

Working Group

  • João Paulo Machado Abreu, pg53928
  • João Pedro Dias Faria, pg53939
  • Ricardo Cardoso Sousa, pg54179