Poor Vader Sentiment Accuracy. Lots of influential words missing from the vader_lexicon.txt
bdteo opened this issue · 5 comments
So, I tried running this implementation of the Vader algorith on this dataset: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences
Everything I do is: vader(normalize_tokens(tokenize('and . ' . $sample[0])))
(adding 'and . ' as a dummy first word as a workaround for a bug in the library)
Here are the results:
[
"vader" => array:3 [
"amazon_cells_labelled.txt" => array:9 [
"positive" => 500
"negative" => 500
"matched-positive" => 367
"failed-positive" => 133
"matched-negative" => 223
"failed-negative" => 277
"matched-neutral" => 320
"matched-%-positive" => 73.4
"matched-%-negative" => 44.6
]
"imdb_labelled.txt" => array:9 [
"positive" => 500
"negative" => 500
"matched-positive" => 364
"failed-positive" => 136
"matched-negative" => 233
"failed-negative" => 267
"matched-neutral" => 261
"matched-%-positive" => 72.8
"matched-%-negative" => 46.6
]
"yelp_labelled.txt" => array:9 [
"positive" => 500
"negative" => 500
"matched-positive" => 358
"failed-positive" => 142
"matched-negative" => 178
"failed-negative" => 322
"matched-neutral" => 350
"matched-%-positive" => 71.6
"matched-%-negative" => 35.6
]
]
I read how the algorithm works and I liked its simplicity.
However the accuracy in the upper example seems to be extremely poor ! - Mainly because of the lean lexicon.
Are there fuller lexicons for the Vader algorithm ? What can I do to improve accuracy other than that ?
As you can see the accuracy classifying negative sentences is beyond tragic.
The repo with the lexicon can be found here, https://github.com/cjhutto/vaderSentiment. I wonder if there is an issue with how you parse the data files.
Can you provide code? I would like to understand why you are getting poor results and also be able to reproduce the error that is fixed by prepending, "and" to the sample.
Alternatively, since you have labeled sentences you can train a naive bayes algorithm easily.
Here is the code I wrote to conduct the tests:
<?php
require_once __DIR__ . '/../../vendor/autoload.php';
$dataDir = __DIR__ . '/test-data';
$dataFiles = scandir($dataDir);
$dataSet = [];
foreach ($dataFiles as $dataFile) {
$dataPath = $dataDir . '/' . $dataFile;
$dummyEnclosure = "\u{8999}";
if (is_file($dataPath)) {
$dataSet[$dataFile] = array_map(function ($e) use ($dummyEnclosure) {
return str_getcsv($e, "\t", $dummyEnclosure);
}, file($dataPath));
}
}
$results = [];
$results['vader'] = [];
$normalizeWord = function ($word) {
return strtolower(preg_replace('/\PL/u', '', $word));
};
foreach ($dataSet as $set => $samples) {
$results['vader'][$set] = [
'positive' => 0,
'negative' => 0,
'matched-positive' => 0,
'failed-positive' => 0,
'matched-negative' => 0,
'failed-negative' => 0,
'matched-neutral' => 0,
];
foreach ($samples as $sample) {
$result = vader(
(normalize_tokens(tokenize('and . ' . $sample[0]), $normalizeWord))
);
$expectedPositivity = (int)$sample[1];
$expectedNegativity = (int)!$sample[1];
$positivity = (int)($result['neg'] < 0.2 && ($result['compound'] > 0.1000));
$negativity = (int)($result['neg'] > 0.1 || $result['compound'] < 0);
$neutrality = (int)(!$positivity && !$negativity);
$results['vader'][$set]['positive'] += $expectedPositivity;
$results['vader'][$set]['negative'] += $expectedNegativity;
if ($expectedPositivity) {
$results['vader'][$set]['matched-positive'] += (int)($positivity === $expectedPositivity);
$failedPositivity = $positivity !== $expectedPositivity;
$results['vader'][$set]['failed-positive'] += (int)($failedPositivity);
// if ($filedPositivity) {
// dump('Should be positive +: ' . $sample[0]);
// }
}
if ($expectedNegativity) {
$results['vader'][$set]['matched-negative'] += (int)($negativity === $expectedNegativity);
$failedNegativity = $negativity !== $expectedNegativity;
$results['vader'][$set]['failed-negative'] += (int)($failedNegativity);
// if ($failedNegativity) {
// dump(['Should be negative: ' . $sample[0], $result, $negativity, $positivity]);
// }
}
$results['vader'][$set]['matched-neutral'] += $neutrality;
}
}
foreach ($results as $tool => $result) {
foreach ($result as $set => $source) {
$results[$tool][$set]['matched-ratio-positive'] = ($results[$tool][$set]['matched-positive'] / ($results[$tool][$set]['positive'])) * 100;
$results[$tool][$set]['matched-ratio-negative'] = ($results[$tool][$set]['matched-negative'] / ($results[$tool][$set]['negative'])) * 100;
}
}
dump($results);
The code runs inside a Laravel project.
Prepending "and " solves this:
>>> vader(normalize_tokens(tokenize('ok great power'), function ($word) {return strtolower(preg_replace('/\PL/u', '', $word));})) ``PHP Notice: Undefined offset: -1 in /var/mena-tokenrush/vendor/yooper/php-text-analysis/src/Sentiment/Vader.php on line 288
PHP Warnings are treated as exceptions in Laravel apps. PHP Notices are treated as exceptions in our app. This is the reason.
Naive Bayes has its flaws but I may combine 2 or 3 approaches together.
Thank you much for your prompt response !
Hello,
I fixed the software bug that was being triggered when the sentence was too short. You no longer need to add, "and" to the list of tokens. The fix will be available in release version 1.4.1. I got slightly better results than you, after the bug was fixed. Feel free to improve upon the lexicon and contribute your changes back to the project.
array(1) {
'vader' =>
array(3) {
'amazon_cells_labelled.txt' =>
array(9) {
'positive' =>
int(500)
'negative' =>
int(500)
'matched-positive' =>
int(408)
'failed-positive' =>
int(92)
'matched-negative' =>
int(285)
'failed-negative' =>
int(215)
'matched-neutral' =>
int(223)
'matched-ratio-positive' =>
double(81.6)
'matched-ratio-negative' =>
double(57)
}
'imdb_labelled.txt' =>
array(9) {
'positive' =>
int(500)
'negative' =>
int(500)
'matched-positive' =>
int(375)
'failed-positive' =>
int(125)
'matched-negative' =>
int(337)
'failed-negative' =>
int(163)
'matched-neutral' =>
int(163)
'matched-ratio-positive' =>
double(75)
'matched-ratio-negative' =>
double(67.4)
}
'yelp_labelled.txt' =>
array(9) {
'positive' =>
int(500)
'negative' =>
int(500)
'matched-positive' =>
int(397)
'failed-positive' =>
int(103)
'matched-negative' =>
int(263)
'failed-negative' =>
int(237)
'matched-neutral' =>
int(235)
'matched-ratio-positive' =>
double(79.4)
'matched-ratio-negative' =>
double(52.6)
}
}
}
After looking more at the Vader class, I have decided to make the properties within it more accessible. You can now modify the tokens and their weights. Please review the changes in commit ad5b139 for ways to directly modify properties of the class to better improve your results.
Changes will be available in version 1.4.2 release.
I will close this issue tomorrow, if no feedback is received.
Cheers,
Awesome ! Thank you ! A bigger Lexicon would be the other big improvement but I may suggest it in the Vader repo.