/scent-of-literature

Russian literature sentiment analysis in terms of very small dataset

Primary LanguagePythonMIT LicenseMIT

Scent of literature

Russian literature sentiment analysis in terms of very small dataset

It uses

  • Pandas to read the input data
  • Sklearn for the classification work

Usage

Just run this in terminal:

./eval.py

Variables

  • is_test_run is the boolean, which defines whether it should just show report about the performance by testing itself on a training dataset or perform a real prediction on a test_file
  • train_file is the path to the training dataset, which should contain text and labels (right now columns are 1 and 2, because of the structure of the default train.tsv file)
  • test_file is the path to the file you want to perform prediction on, it should contain only a single column with text you want to analyze(by default it searches for 0-th column because of the data.txt structure)

Under the hood

To create the vectors dictionary it uses TfIdfVectorizer which uses inverted frequencies table method to get the weights from the words and bigrams we give, more on tf-idf here

To perform the classification it uses SGD classifier(also here is wiki) with hinge as a loss function, aka SVM, which shows the best results in sentiment analysis afaik, but has more tuning options than LinearSVC

Note: the model's hyperparameters are chosen by sklearn's GridSearchCV (more on this here) and those are tuned to match the best F1 score

Contribution

If you want to improve the prediction performance somehow and you can prove it with the better F1 score, you are always welcome to send me some PRs