Scent of literature

Russian literature sentiment analysis in terms of very small dataset

It uses

Pandas to read the input data
Sklearn for the classification work

Usage

Just run this in terminal:

./eval.py

Variables

is_test_run is the boolean, which defines whether it should just show report about the performance by testing itself on a training dataset or perform a real prediction on a test_file
train_file is the path to the training dataset, which should contain text and labels (right now columns are 1 and 2, because of the structure of the default train.tsv file)
test_file is the path to the file you want to perform prediction on, it should contain only a single column with text you want to analyze(by default it searches for 0-th column because of the data.txt structure)

Under the hood

To create the vectors dictionary it uses TfIdfVectorizer which uses inverted frequencies table method to get the weights from the words and bigrams we give, more on tf-idf here

To perform the classification it uses SGD classifier(also here is wiki) with hinge as a loss function, aka SVM, which shows the best results in sentiment analysis afaik, but has more tuning options than LinearSVC

Note: the model's hyperparameters are chosen by sklearn's GridSearchCV (more on this here) and those are tuned to match the best F1 score

Contribution