Corpus for Aspect-Based Sentiment Analysis in Russian-language sentences from different social sources.
This repository provides dataset and additional materials of the paper: "Neural-network Method for Determining Text Author’s Sentiment to an Aspect Specified by the Named Entity".
The dataset was prepared for the three sentiment label (positive, neutral, and negative) using a crowdsourcing platform and a home-grown procedure for collecting and controlling annotators markup.
Version 0.0.1 (paper version)
The data is in Russian.
Data was collected from several sources: posts of the Live Journal social network, texts of the online news agency Lenta.ru, and Twitter microblog posts.
An example for an instance from the dataset is shown below:
{
'context': {
'text': 'даа уж, один Кирюха только и понимает ;-)',
'sentences': [
[
{'forma': 'даа', 'lemma': 'да'},
{'forma': 'уж', 'lemma': 'уж'},
{'forma': ',', 'lemma': ','},
{'forma': 'один', 'lemma': 'один'},
{'forma': 'Кирюха', 'lemma': 'Кирюх'},
{'forma': 'только', 'lemma': 'только'},
{'forma': 'и', 'lemma': 'и'},
{'forma': 'понимает', 'lemma': 'понимать'},
{'forma': ';', 'lemma': ';'},
{'forma': '-', 'lemma': '-'},
{'forma': ')', 'lemma': ')'}
]
]
},
'aspect': {
'text': 'Кирюха',
'aspect_loc': [{'pos_start': 13, 'pos_end': 19}],
'ner': 'PER'
},
'label': 1,
'source': 'twitter'
}
Sentiment label codes: {0: 'neutral', 1: 'positive', 2: 'negative'}
In total, dataset contains 5,552 unique sentences (6,705 pairs aspect-sentiment) with 1,580 sentences from LiveJournal posts, 2,050 sentences from Lenta.Ru, and 1,992 sentences from Twitter.
The dataset includes a set of train/test splits, with 5,364, and 1,341 examples respectively.
The dataset are uploaded to our cloud storage and are available here.
The number of labels by data source is presented in Table 1 and the number of unique named entities is presented in Table 2.
Data source | Positive | Neutral | Negative |
---|---|---|---|
977 | 510 | 510 | |
Lenta.ru | 478 | 1653 | 472 |
LiveJournal | 834 | 905 | 366 |
Total | 2289 | 3068 | 1348 |
Data source | Person | Organisations | Total |
---|---|---|---|
1432 | 275 | 1818 | |
Lenta.ru | 1244 | 573 | 1817 |
LiveJournal | 1307 | 285 | 1592 |
Total | 3761 | 1068 | 4829 |
Annotating sentences with labels of their emotions was performed with the help of a crowdsourcing platform Yandex Toloka.
Only those of the 30% of the best-performing (by the platform’s internal rating) active Russian-speaking users over 18 years old were allowed into the annotation process. Before platform users became annotators, they underwent a training session in which they had to annotate 25 test cases (which we annotated ourselves), with more than 80% accuracy.
The annotators were asked to determine in what sentiment the author uses the named entity in the selected sentences (the classes of sentiment were “Positive”, “Neutral” and “Negative”). The sentence could not be marked with multiple tags. If the annotator could not unambiguously determine the sentiment class of the selected aspect, this example was marked with the label “I find it difficult to determine” and, in the absence of other annotations, was not included in the resulting dataset. If the selected aspect was incorrectly defined by a named entity, the example was marked as “Wrong aspect” and was also not included in the final dataset.
Sentences were split into tasks and assigned to annotators so that each sentence was annotated at least three times.
The final label for a sentence was chosen based on the aggregate of annotator labels by majority vote.
Each instance is a text sentence in Russian with one aspect sentiment label which includes the following fields:
Each instance is a text sentence in Russian with one specific named entity (aspect) and its corresponding sentiment label, which are included in the following fields:
- context:
- text: the text of the sentence;
- sentences: text tokenized and lemmatized with udpipe
- forma: the original word form;
- lemma: the lemma of this word
- aspect:
- ner: tag name of the named entity of the aspect;
- aspect_loc:
- pos_start: the position of the start symbol in the text of the sentence for the aspect
- pos_end: the position of the ending character in the text of the sentence for the given aspect
- label: the sentiment label for aspect;
- source: the tag name of the corresponding source
We compare several basic solutions:
- Random: Random selection of a label for each aspect;
- Lexicon: Classifier based on dictionaries of positive and negative vocabulary. The aspect is assigned sentiment label that belongs to the vocabulary with the largest number of words from which are present in the sentence. If the number of words included in the dictionaries of different sentiment is the same, then the label of the most representative sentiment of the corpus is given, i.e. "positive" in this case. If the sentence does not contain any words from the dictionaries then the aspect is considered neutral.
- TPOT (ELMo): The average vectors of aspect words of the analyzed sentence obtained from the ELMo model are used as input features for the classification model. The type of classifier and its parameters were selected automatically by the TPOT library;
Calculation of baselines is presented in the jupyter notebook ./notebooks/Baseline_accuracy.ipynb
Before running the notebook, you need to download the data and place them in a folder ./data/
.
The proposed method is based on deep neural network with attention (IAN), which solves a classification task.
In our model, the context is all the words of the sentence which contains a named entity, and the aspect is the words that belong to the same named entity for which sentiment is determined. The sentence words are vectorized using the bi-directional language model ELMo.
The figure below shows the scheme of the proposed model.
The performance of the baseline models on our dataset in terms of F1 scores is presented in Table 2.
Model | Twitter | LJ | Lenta .ru | All | ||||
---|---|---|---|---|---|---|---|---|
mic. | mac. | mic. | mac. | mic. | mac. | mic. | mac. | |
Random | 0.34 | 0.33 | 0.34 | 0.33 | 0.34 | 0.32 | 0.31 | 0.30 |
Lexicon | 0.48 | 0.36 | 0.42 | 0.34 | 0.35 | 0.34 | 0.41 | 0.35 |
TPOT (ELMo) | 0.63 | 0.57 | 0.59 | 0.57 | 0.74 | 0.71 | 0.66 | 0.65 |
Our approach | 0.64 | 0.58 | 0.67 | 0.66 | 0.78 | 0.72 | 0.72 | 0.70 |
In the published article, incorrect accuracies were reported in Table 2 for the baseline models "Random" and "Lexicon", but the overall result and conclusions were not affected, the correct accuracies are reported in Table 2 of this repository.
- Python 3.7+
- scikit-learn==0.22.1
- tpot==0.11.1
If you have found our results helpful in your work, feel free to cite our publication and this repository as
@article{naumov2020neural,
title={Neural-network Method for Determining Text Author’s Sentiment to an Aspect Specified by the Named Entity},
author={Naumov, Aleksandr and Sboev, Alexander and Rybka, Roman and Selivanov, Anton and Gryaznov, Artem},
year={2020}
}
Thanks to @naumov-al for adding this dataset.