/project2

Primary LanguagePython

Preprocessed tweet files

Go to our github project page to download necessary files!!!

Tweet Sentiment Analysis

The competition task was to predict if a tweet message used to contain a positive :) or a negative :( smiley, by considering only the remaining text. Our team conducted comprehensive research on the proposed solutions in the relevant literature, as well as past projects and articles which tackled similar issues regarding text sentiment analysis. Full specification of our experiments, as well as results and conclusions drawn can be found in our report.

Complete project specification is available on the course's GitHub page.

Dependencies

Following dependencies are required in order to run the project:

Libraries

  • Anaconda3 - Download and install Anaconda with Python3

  • Scikit-Learn - Download scikit-learn library with conda

    conda install scikit-learn
  • Gensim - Install Gensim library

    conda install gensim
  • NLTK - Download all the packages of NLTK

    python
    >>> import nltk
    >>> nltk.download()
  • Tensorflow - Install tensorflow library (version used 1.4.1)

    $ pip install tensorflow

Files

  • Train tweets

    Download twitter-datasets.zip containing positive and negative tweet files which are required during the model training phase. After unzipping, place the files obtained in the ./data/datasets directory.

  • Test tweets

    Download test_data.txt containing tweets which are required for the testing of the trained model and obtaining score for submission to Kaggle. This file needs to be placed in the ./data/datasets directory.

  • Stanford Pretrained Glove Word Embeddings

    Download Glove Pretrained Word Embeddings which are used for training advanced sentiment analysis models. After unzipping, place the file glove.twitter.27B.200d.txt in the ./data/glove directory.

Hardware requirements

  • at least 16 GB of RAM
  • a graphics card (optional for faster training involving CNNs)

Tested on Ubuntu 16.04 with Nvidia Tesla K80 GPU with 12 GB GDDR5

Kaggle competition

Public Leaderboard connected to this competition.

Our team's name is Bill Trader.

Team members:

Reproducing our best result

You can find already preprocessed tweet files test_full.csv, train_neg_full.csv.zip and train_pos_full.csv.zip in the ./data/parsed directory.

To run preprocessing again you must have Train tweets and Test tweets files in the ./data/dataset directory. Then go to folder /src and run run_preprocessing.py with argument train or test to generate requried files for running the CNN.

$ python run_preprocessing.py train or test

To reproduce our best score from Kaggle go to folder /src and run run_cnn.py with argument eval

$ python run_cnn.py eval

In data/models/1513824111 directory is stored a checkpoint for reproducing our best score so the training part will be skipped. If you want to run the training process from scratch, just pass the argument train when runnig run_cnn.py.

To run the evaluation you must have the necessary files. File glove.twitter.27B.200d.txt in the ./data/glove directory and preprocessed tweet files test_full.csv, train_neg_full.csv and train_pos_full.csv in the ./data/parsed directory.


This project is available under MIT license.