/twitter_sentiments

This repo contains code of work on twitter sentimental anlysis

Primary LanguageTeX

Twitter Sentiment Analysis (Text classification)

we had a complete dataset of 2500000 tweets. One half of tweets are positive labels and the other half are negative labels Our task was to build a classifier to predict the test dataset of 10000 tweets. This README.md illustrates the the implementation of the classifier, and present the procedure to reproduce our works. The details of our implementation were written in the report.

Dependencies

All the scripts in this project ran in Python 3.5.2, the generic version on GCP instance. For nueral network framework, we used Keras, a high-level neural networks API, and use Tensorflow as backend.

The NVIDIA GPU CUDA version is 8.0 and the cuDNN version is v6.0. Although, there are newer version of CUDA and cuDNN at this time, we use the stable versions that are recommended by the official website of Tensorflow.

  • [Scikit-Learn] (0.19.1)- Install scikit-learn library with pip

    $ sudo pip3 install scikit-learn
  • [Gensim] (3.2.0) - Install Gensim library

    $ sudo pip3 install gensim
  • [FastText] (0.8.3) - Install FastText implementation

    $ sudo pip3 install fasttext
  • [NLTK] (3.2.5) - Install NLTK and download all packages

    // Install
    $ sudo pip3 install nltk
    
    // Download packages
    $ python3
    $ >>> import nltk
    $ >>> nltk.download()
  • [Tensorflow] (1.4.0) - Install tensorflow. Depends on your platfrom, choose either without GPU version or with GPU version

    // Without GPU version
    $ sudo pip3 install tensorflow
    
    // With GPU version
    $ sudo pip3 install tensorflow-gpu
  • [Keras] (1.4.0) - Install Keras

    $ sudo pip3 install keras
  • [XGBoost] (0.6a2) - Install XGboost

    $ sudo pip3 install xgboost

Folder / Files

  • segmenter.py:
    helper function for preprocessing step

  • data_loading.py:
    helper function for loading the original dataset and output pandas dataframe object as pickles.

  • data_preprocessing.py:
    Module of preprocessing. Take output of data_loading.py and output preprocessed tweets

  • cnn_training.py:
    Module of three cnn models The the output of data_preprocessing.py and generate result as input of xgboost_training.py

  • xgboost_training.py:
    Module of xgboost model. Take the output of cnn_training.py and generate the prediction result.

  • run.py:
    Script for running the modules, data_loading.py, data_preprocessing.py, cnn_training.py and xgboost_training.py.

  • data:
    This folder contains the necessary metadata and intermediate files while running our scripts.

    • tweets: Contain the original train and test dataset downloaded from Kaggle.
    • dictionary: Contain the text files for text preprocessing
    • pickles: Contain the intermediate files of preprocessed text as the input of CNN model
    • xgboost: Contain the intermediate output files of CNN model and there are the input of XGboost model.
    • output : Contain output file of kaggle format from run.py

    Note: The files inside tweets and dictionary are essential for running the scripts from scratch. Download tweets and dictionary Then, unzip the downloaded file and move the extracted tweets and dictionary folder in data/ directory.

    If you want to skip the preprocessing step and CNN training step, download preprocessed data and pretrained model. Then, unzip the downloaded file and move all the extracted folders in data/ directory.

  • othermodels:

    The files in this folder are the models we explored, before coming out the best model.

    • keras_nn_model.py: This is the classifier using NN model and the word representation method is GloVE. Each was represented by the average of the sum of each word and fit into NN model.

    • fastText_model.py: This is the classifier using FastText. The word representation is FastText english pre-trained model.

    • svm_model.py: This is the classifier using support vector machine. The word representation is TF-IDF by using Scikit-Learn built-in method.

Reproduce Our Best Score

Here are our steps from original dataset to submission file in order. We had modulized each step into .py file, they can be executed individually. For your convenience, we provide run.py which could run the modules with simple command.

  1. Transform dataset to pandas dataframe - data_loading.py
  2. Preprocessing dataset - data_preprocessing.py
  3. CNN model training - cnn_training.py
  4. XGboost model training and generate submission file - xgboost_training.py

First, make sure all the essential data is put into "data/" directory

Second, there are three options to generate submission file. We recommand the first options, which takes less than 10 minutes to reproduct the result with pretrianed models.

-if you want to skip preprocessing step and CNN model training step, execute run.py with -m argument "xgboost"

    $ python3 run.py -m xgboost

  Note: Make sure that there are test_model1.txt, test_model2.txt, test_model3.txt, train_model1.txt, train_model2.txt and train_model3.txt in "data/xgboost in order to launch run.py successfully.

-if you want to skip preprocessing step and start from CNN model training setp, execute run.py with -m argument "cnn"

    $ python3 run.py -m cnn

Note: Make sure that there are train_clean.pkl and test_clean.pkl in "data/pickles in order to launch run.py successfully.

-if you want to run all the steps from scratch, execute run.py with -m argument "all"

    $ python3 run.py -m all

Note: our preprocessing step require larges amount of CPU resource. It is a multiprocessing step, and will occupy all the cores of CPU.

Finally, you can find prediction.csv in "data/output" directory