Introduction_NLP

Stance Classification in Tweets

Pre-Processing

Step 1: Execute `python3` `preprocessor.py` `--in_path` `--out_path` `--remove_numbers` `--remove_special_characters` `--remove_stopwords` `--stem`

in_path is the data to be preprocessed. Default file is data/semeval2016-task6-trainingdata.txt
out_path should be the location of your output data. Default location is output/

Step 2: Pass in hyperparameters for further tunning:

remove_numbers will remove all digits from 0 to 9
remove_special_characters will remove sepecial characters from the dataset
remove_stopwords will remove English sopwords
stem will apply stemming on the dataset

Feature Engineering

Calculation of Term Frequency - Inverse Document Frequency `TF-IDF` was done using the following procedure:

`fit_transformer()`

`get_feature_names()` is made `index` of the `DataFrame`

`todense()` is applied to make the Dataframe dense

`transpose()` replaces row with columns and columns with rows to have the Bag of Words (BOW) on as `columns` instead of `rows`

RESTful-API

To read the documentation and the format of the POST requests, run restapi.py and from URL go to /docs and/or /redoc

LaverdeS/Introduction_NLP

Introduction_NLP

Pre-Processing

Step 1: Execute python3 preprocessor.py --in_path --out_path --remove_numbers --remove_special_characters --remove_stopwords --stem

Step 2: Pass in hyperparameters for further tunning:

Feature Engineering

Calculation of Term Frequency - Inverse Document Frequency TF-IDF was done using the following procedure:

fit_transformer()

get_feature_names() is made index of the DataFrame

todense() is applied to make the Dataframe dense

transpose() replaces row with columns and columns with rows to have the Bag of Words (BOW) on as columns instead of rows