/SE464

Machine Learning | Hate Speech Labeler

Primary LanguagePython

SE464 Machine Learning Project

Hate Speech Labeler

Streamlit app for hate speech detection using a fine-tuned BERT-based model. The model is trained on the Jigsaw Toxic Comment Classification Challenge dataset for multi-label classification.

  • Code and data is available at this notebook

  • The app is deployed and can be tested here (also available at this link)

  • The model is available at hugging face

image

Local Installation

  • Clone the repository:

    git clone https://github.com/berkaysahiin/SE464.git
  • Change into the directory:

    cd SE464
  • Virtual Environments:

    virtualenv venv
    .\venv\Scripts\activate
  • Requirements:

    pip install -r requirements.txt
    # if fails try before: pip install pipreqs && pipreqs 
    
  • Run the Streamlit app:

    streamlit run main.py
    

Model

  • Data preprocessing involves cleaning text data, tokenization, and formatting for multi-label classification.

  • The model is trained with TrainingArguments and Trainer from the Transformers library.

  • Metrics such as F1 score, ROC AUC, and accuracy are used to evaluate the model's performance on the test set.