Russian literature sentiment analysis in terms of very small dataset
- Pandas to read the input data
- Sklearn for the classification work
Just run this in terminal:
./eval.py
is_test_run
is the boolean, which defines whether it should just show report about the performance by testing itself on a training dataset or perform a real prediction on atest_file
train_file
is the path to the training dataset, which should contain text and labels (right now columns are1
and2
, because of the structure of the defaulttrain.tsv
file)test_file
is the path to the file you want to perform prediction on, it should contain only a single column with text you want to analyze(by default it searches for 0-th column because of thedata.txt
structure)
To create the vectors dictionary it uses TfIdfVectorizer which uses inverted frequencies table method to get the weights from the words and bigrams we give, more on tf-idf here
To perform the classification it uses SGD classifier(also here is wiki) with hinge as a loss function, aka SVM, which shows the best results in sentiment analysis afaik, but has more tuning options than LinearSVC
Note: the model's hyperparameters are chosen by sklearn's GridSearchCV (more on this here) and those are tuned to match the best F1 score
If you want to improve the prediction performance somehow and you can prove it with the better F1 score, you are always welcome to send me some PRs