/treehopper

A Tree-LSTM-based dependency tree sentiment labeler

Primary LanguagePythonApache License 2.0Apache-2.0

treehopper

treehopper is a Tree-LSTM-based dependency tree sentiment labeler, implemented in PyTorch and optimized for morphologically rich languages with relatively loose word order (such as Polish).

treehopper was originally developed as a submission for PolEval 2017, a SemEval-inspired NLP evaluation contest for Polish. It scores 0.80 accuracy on PolEval task 2 evaluation dataset. For more details see paper accompanying this submission: Fine-tuning Tree-LSTM for phrase-level sentiment classification on a Polish dependency treebank.

What the heck are Tree-LSTMs and dependency tree sentiment labeling?

A dependency tree is a linguistic formalism used for describing the structure of sentences. They are parse trees just like constituency trees, but slightly more useful when dealing with languages with complex inflectional structure and relatively loose word order such as Czech, Turkish, or Polish.

Tree sentiment labeling is the task of labeling each phrase (subtree) of a parse tree with its sentiment. Stanford Sentiment Treebank is one famous dataset for this task, but using constituency trees as its underlying linguistic formalism of choice.

Tree-LSTMs (Tai et al., 2015) generalize LSTMs from chain-like to tree-like structures, enabling state-of-the-art tree sentiment labeling. treehopper implements a variant of Tree-LSTMs known as Child-Sum Tree-LSTM, where each node of a tree can have an unbounded number of children and there is no order over those children. This approach is particularly well-suited for dependency trees.

How to use

First things first:

git clone git@github.com:tomekkorbak/treehopper.git

Dependencies

Make sure to use Python>=3.5, PyTorch>=0.2 and a Unix-like operating system (sorry, Windows users).

We recommend managing your dependencies using virtualenv and pip. For instructions on installing an appropriate PyTorch version please refer to its website. All other dependencies can be installed by running pip install -r requirements.txt.

Inference using a pre-trained model

We provide a pre-trained model, trained on full PolEval training dataset (excluding evaluation dataset) with default hyperparameters (i.e. those described in the paper).

The script assumes the data to be tokenized and parsed. Specifically, input_sentences must be a list of tokenized sentences separated by a newline character. input_parents is a list of dependency trees in PolEval format (i.e. each token is assigned with an index of its parent).

cd treehopper/
curl -o model.pth <<URL WILL BE ADDED HERE>>
python predict --model_path model.pth \
               --input_parents test/polevaltest_parents.txt \
               --input_sentences test/polevaltest_sentence.txt \
               --output output.txt

Evaluating a pre-trained model

./fetch_data.sh
cd treehopper/
python evaluate.py --model_path model.pth

By default, evaluation is against PolEval evaluation dataset.

Training from scratch

./fetch_data.sh
cd treehopper/
python train.py

By default, models trained are saved per epoch in /models/saved_models/.

Documentation

For a complete API documentation, please run predict.py, train.py, or evaluate.py with --help flag.

All flags default to hyperparameters described in the paper.

Authors

Tomasz Korbak (tomasz.korbak@gmail.com)
Paulina Żak (paulina.zak1@gmail.com)

How to cite

@article{korbakzak2017,
  author    = {Tomasz Korbak and
               Paulina \.Zak},
  title     = {Fine-tuning Tree-LSTM for phrase-level sentiment classification on
               a Polish dependency treebank. Submission to PolEval task 2},
  journal   = {Proceedings of the 8th Language & Technology Conference (LTC 2017)},
  year      = {2017},
  url       = {http://arxiv.org/abs/1711.01985}
}

Acknowledgements

treehopper core code was loosely based on TreeLSTMSentiment, which was based on Tree-LSTM's original Lua implementation of Tai et al., 2015.