The goal of this project is to build a sentiment classifier that
predicts whether a tweet text used to include a positive smiley :)
or a negative smiley :(
, based on the remaining text.
The cluster is especially useful for the training of the neural networks. Login via ssh:
ssh <user>@login.leonhard.ethz.ch
Copy the script at utils/install_on_leonhard.source
to your home
dirctory (scp ./utils/install_on_leonhard.source <user>@login.leonhard.ethz.ch:install_on_leonhard.source
)and source
it. Then clone this repository git clone https://github.com/phil9987/cil_2018_text_sentiment.git
.
Navigate into the data/
sub-directory and download the datasets via:
curl http://www.da.inf.ethz.ch/teaching/2018/CIL/material/exercise/twitter-datasets.zip -o twitter-datasets.zip
unzip twitter-datasets.zip
mv twitter-datasets/* .
Then download the pre-trained word embeddings from the GloVe project:
curl https://nlp.stanford.edu/data/glove.twitter.27B.zip -o glove.twitter.27B.zip
unzip glove.twitter.27B.zip -d glove.twitter.27B
This should leave the data directory in the following state:
data/test_data.txt
data/train_neg_full.txt
data/train_neg.txt
data/train_pos_full.txt
data/train_pos.txt
data/glove.twitter.27B/glove.twitter.27B.25d.txt
data/glove.twitter.27B/glove.twitter.27B.50d.txt
data/glove.twitter.27B/glove.twitter.27B.100d.txt
data/glove.twitter.27B/glove.twitter.27B.200d.txt
From inside the utils/
sub-directory run the pre-processing script:
cd utils
source preprocess_data.source
From inside the utils/
sub-directory run the activation script:
cd utils
source activate_on_leonhard.source
The code for the random forest classification is contained in the
baseline/
sub-directory. It can be run via python baseline.py
. On
the cluster, a job should be started as follows:
bsub -B -N -n 4 -R "rusage[mem=16000,ngpus_excl_p=1]" python3 simple_rnn.py
It's worth to have a look at the documentation of scikit learn, there are many parameters which can be explored: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
The code for the recurrent neural net baseline is contained in the
baseline_simple_nn/
sub-directory. It can be run as follows:
bsub -B -N -n 4 -R "rusage[mem=16000,ngpus_excl_p=1]" python3 simple_rnn.py
The code for our own model is contained in the our_model/
sub-directory. Our approach combines a recurrent neural network with
additions and tweaks to make it perform better. We utilize
TensorFlow's estimator interface.
bsub -B -N -n 4 -R "rusage[mem=16000,ngpus_excl_p=1]" python3 main_v8.py
Our final, best performing model is contained in
our_model/main_v8.py
, but code for a couple of variations on this
model are still available under our_model/archive/
.