Ubik Sentiment

The project creates a simple machine learning model to classify positive or negative emotion of a given sentence.

The training data is in finnish and generated from Korppi with a simple heuristic. Korppi corpus API is searched for sentences containing happy smile ':)' and sad smileys ':(' which are assumed to note for positive and negative sentiments respectively.

To evaluate the model's feasibility for real world use a simple annotation tool is can be used to annotate data for evaluation.

A cli tool is provided the test the sentiment of a sentence:

$ python test_sentence.py

Anna lause: Auto ei ollutkaan niin hyväkuntoinen kun myyjä antoi ymmärtää
['neg']
Anna lause: Pahemmastakin suosta on noustu
['pos']

Installation

Create some training and test data if you don't have it already.

To get some finnish language raw smiley data from Korppi:

$ python fetch_rawdata.py > korp_all_sentences.txt

Then process it to create training and development sets and data to annotate into evaluation set:

$ generate_sets.sh

Then process it to create training and development sets and data to annotate into evaluation set: If you want to create human annotated evaluation set, use annotate.py:

$ python annotate.py
To annotate lines, press p for positive, n for negative, any other letter to discard. Q to quit.

siis koitas perheen kans vaikkapa ihan perinteistä munien maalaamista
pos
leivo jotain tai saahan sitä kaupoista valmistakin
quit

Annotated 1 positives and 0 negative sentences. Discarded 0 rows.
In total 1 sentences have been processed.

Next create the model.

$ python train_classifier.py

Warning: enforcing limit of 10000 samples to dataset ./data/korp_train_pos.txt
Warning: enforcing limit of 10000 samples to dataset ./data/korp_devel_pos.txt
Found 14768 training sentences, 12330 development testing sentences.
Creating a vectorizer...
Vectorizing training set...
Vectorizing development set...
Accuracy: 0.664 (devel set), 0.958 (training set).
Saving vectorizer and classifier...
Done.

Finally try the model with a sentence.

$ python test_sentence.py

Work definitions for positive and negative emotions

For purposes of manual annotation of evaluation data the following definition was used:

Negative sentiment contains

negative emotions such as anger or sadness
uncertainty, confusion
anti-social behaviour such as judging, calling names or insulting others
contains cursing or vulgarity

Positive sentiment contains

positive emotions such as excitement or happiness
encouraging, hopeful or helpful behaviour

Author

tjkemp

Distributed under the MIT license. See license.txt for more information.

tjkemp/ubik-sentiment

Ubik Sentiment

Installation

Work definitions for positive and negative emotions

Author