Abusive-Language-Detection-Categorization: A Python repository from HongyuGong

This is the TensorFlow implementation of the paper ``Supervised Neural Attention Improves Abusive Language Detection and Categorization''.  


Requirements:

1. Python3
2. Numpy
3. Sklearn
4. TensorFlow
5. [CMUTweetTagger](https://github.com/ianozsvald/ark-tweet-nlp-python). Put CMUTweetTagger.py and CMUTweetTokenizer.py in src/data_util/.
6. Preprocessor (install using ``pip install tweet-preprocessor'')


Running instructions:

1. Data preparation

(1) Create data folder data/ and dump folder dump/ in the same level of the folder src/.
(2) Download [datasets](https://drive.google.com/file/d/149nQCmB6EAflW_22wltZnjmf_EPzg2N7/view?usp=sharing). Put Anonymized_Sentences_Classified.csv and Anonymized_Comments_Categorized.csv in data/ in the same level of src/.
(3）Download pre-trained word2vec embedding.
(4) Tag with CMUTweetTagger

2. Preprocess datasets

```
python3 data_util/preproc_classification_data.py
'''

```
python data_util/preproc_categorization_data.py
'''

Build word and part-of-speech tag vocabulary, and word embeddings

```
python3 data_util/build_vocab_emb.py --wiki_embed_fn [EMBED_FN]
'''

[EMBED_FN]: the path of downloaded word2vec embedding.

3. Train abusive language classifier

```
CUDA_VISIBLE_DEVICES=0 python3 abuse_classification/train.py
'''

Test abusive language classifier

```
CUDA_VISIBLE_DEVICES=0 python3 abuse_classification/test.py
'''

4. Train abusive language categorizer

```
CUDA_VISIBLE_DEVICES=0 python3 abuse_categorization/train.py
'''

Test abusive language categorizer

```
CUDA_VISIBLE_DEVICES=0 python3 abuse_categorization/test.py
'''
HongyuGong/Abusive-Language-Detection-Categorization