NLP - Cross lingual offensive language identification

authors: Gojko Hajdukovic, Simon Dimc, 05.2021

Table of contents:

Setup
Usage

Setup

These instructions assume that the user is in repo's root.

cd <repo_root>

In order to set-up virtual environment issue:

python -m venv venv
#Activate the environment
source venv/bin/activate

To install all project related dependencies issue:

pip3 install -r requirements.txt
python -m spacy download en_core_web_sm

Get datasets: Datasets are in folder data/source_data. Get datasets from following sources and put them into folders:
- data/source_data/eng/binary/dataset_1
  source: https://github.com/sjtuprog/fox-news-comments
  You will have to parse the json file fox-news-comments.json and convert it into a data.csv file with format: Label:Text.
- data/source_data/eng/binary/dataset_2
  source: https://github.com/jing-qian/A-Benchmark-Dataset-for-Learning-to-Intervene-in-Online-Hate-Speech
  reddit data
  Rename file to data.csv.
- data/source_data/eng/binary/dataset_3
  source: https://github.com/jing-qian/A-Benchmark-Dataset-for-Learning-to-Intervene-in-Online-Hate-Speech
  gab data
  Rename file to data.csv.
- data/source_data/eng/binary/dataset_4
  source: https://github.com/Vicomtech/hate-speech-dataset
  Copy following folders and files.
  folders: all_files, sampled_test
  files: annotations_metadata.csv
- data/source_data/eng/multiclass/dataset_5
  source: https://github.com/mayelsherif/hate_speech_icwsm18
  You will have to either download the tweets using the provided Tweet IDs or contact the authors. Put the tweets in csv files in format tweet_id,tweet, inside a downloaded_tweets_dataset folder. Name of the csv files should be the same as in the provided filenames with Tweet Ids.
- data/source_data/eng/multiclass/dataset_6
  source: https://github.com/Mrezvan94/Harassment-Corpus
  You will have to contact the authors. Put the csv files inside a tweets_dataset folder.
- data/source_data/slo/multiclass/dataset_2
  source: https://www.clarin.si/repository/xmlui/handle/11356/1398
  You will have to either download the tweets using the provided Tweet IDs or contact the authors. You will have to parse the data into a data.csv file with format: Text,Class,Type.
Get models:
- CroSloEngual BERT pre-trained model
  source: https://www.clarin.si/repository/xmlui/handle/11356/1330
  Put config.json, pytorch_model.bin, and vocab.txt inside classifiers/bert/CroSloEngual.
- Fine-tuned CroSloEngual BERT models
  source: https://drive.google.com/drive/folders/1j2BJ-X0WdNpxDFJHrmm-DsYuy1GeFb03?usp=sharing
  Put bert/binary.pt and bert/multiclass.pt inside models/bert.

Usage

The project is structured to implement multiple classifiers for two classification tasks, a binary and multiclass. In order to reproduce results from the report a CLI application has been implemented. Following instructions assume that the user is in project's root.

In order to run CLI application with help description issue:

python main.py --help

Examples:

python main.py --prepareData true --type multi --model LR
python main.py -pd false -t bi -m BERT

For BERT fine-tuning you can use the notebooks/bert-notebook.ipynb notebook for Google Colab.

SlavicaJ/COLI

NLP - Cross lingual offensive language identification

authors: Gojko Hajdukovic, Simon Dimc, 05.2021

Setup

Usage