yooper/php-text-analysis

Poor Vader Sentiment Accuracy. Lots of influential words missing from the vader_lexicon.txt

bdteo opened this issue · 5 comments

bdteo commented

So, I tried running this implementation of the Vader algorith on this dataset: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

Everything I do is: vader(normalize_tokens(tokenize('and . ' . $sample[0]))) (adding 'and . ' as a dummy first word as a workaround for a bug in the library)

Here are the results:

[
"vader" => array:3 [
    "amazon_cells_labelled.txt" => array:9 [
      "positive" => 500
      "negative" => 500
      "matched-positive" => 367
      "failed-positive" => 133
      "matched-negative" => 223
      "failed-negative" => 277
      "matched-neutral" => 320
      "matched-%-positive" => 73.4
      "matched-%-negative" => 44.6
    ]
    "imdb_labelled.txt" => array:9 [
      "positive" => 500
      "negative" => 500
      "matched-positive" => 364
      "failed-positive" => 136
      "matched-negative" => 233
      "failed-negative" => 267
      "matched-neutral" => 261
      "matched-%-positive" => 72.8
      "matched-%-negative" => 46.6
    ]
    "yelp_labelled.txt" => array:9 [
      "positive" => 500
      "negative" => 500
      "matched-positive" => 358
      "failed-positive" => 142
      "matched-negative" => 178
      "failed-negative" => 322
      "matched-neutral" => 350
      "matched-%-positive" => 71.6
      "matched-%-negative" => 35.6
    ]
]

I read how the algorithm works and I liked its simplicity.

However the accuracy in the upper example seems to be extremely poor ! - Mainly because of the lean lexicon.

Are there fuller lexicons for the Vader algorithm ? What can I do to improve accuracy other than that ?
As you can see the accuracy classifying negative sentences is beyond tragic.

The repo with the lexicon can be found here, https://github.com/cjhutto/vaderSentiment. I wonder if there is an issue with how you parse the data files.

Can you provide code? I would like to understand why you are getting poor results and also be able to reproduce the error that is fixed by prepending, "and" to the sample.

Alternatively, since you have labeled sentences you can train a naive bayes algorithm easily.

bdteo commented

Here is the code I wrote to conduct the tests:

<?php

require_once __DIR__ . '/../../vendor/autoload.php';

$dataDir = __DIR__ . '/test-data';
$dataFiles = scandir($dataDir);
$dataSet = [];

foreach ($dataFiles as $dataFile) {
    $dataPath = $dataDir . '/' . $dataFile;

    $dummyEnclosure = "\u{8999}";

    if (is_file($dataPath)) {
        $dataSet[$dataFile] = array_map(function ($e) use ($dummyEnclosure) {
            return str_getcsv($e, "\t", $dummyEnclosure);
        }, file($dataPath));
    }
}

$results = [];
$results['vader'] = [];

$normalizeWord = function ($word) {
    return strtolower(preg_replace('/\PL/u', '', $word));
};

foreach ($dataSet as $set => $samples) {
    $results['vader'][$set] = [
        'positive' => 0,
        'negative' => 0,
        'matched-positive' => 0,
        'failed-positive' => 0,
        'matched-negative' => 0,
        'failed-negative' => 0,
        'matched-neutral' => 0,
    ];

    foreach ($samples as $sample) {
        $result = vader(
            (normalize_tokens(tokenize('and . ' . $sample[0]), $normalizeWord))
        );
        $expectedPositivity = (int)$sample[1];
        $expectedNegativity = (int)!$sample[1];

        $positivity = (int)($result['neg'] < 0.2 && ($result['compound'] > 0.1000));
        $negativity = (int)($result['neg'] > 0.1 || $result['compound'] < 0);
        $neutrality = (int)(!$positivity && !$negativity);

        $results['vader'][$set]['positive'] += $expectedPositivity;
        $results['vader'][$set]['negative'] += $expectedNegativity;

        if ($expectedPositivity) {
            $results['vader'][$set]['matched-positive'] += (int)($positivity === $expectedPositivity);

            $failedPositivity = $positivity !== $expectedPositivity;
            $results['vader'][$set]['failed-positive'] += (int)($failedPositivity);

//            if ($filedPositivity) {
//                dump('Should be positive +: ' . $sample[0]);
//            }
        }

        if ($expectedNegativity) {
            $results['vader'][$set]['matched-negative'] += (int)($negativity === $expectedNegativity);

            $failedNegativity = $negativity !== $expectedNegativity;
            $results['vader'][$set]['failed-negative'] += (int)($failedNegativity);

//            if ($failedNegativity) {
//                dump(['Should be negative: ' . $sample[0], $result, $negativity, $positivity]);
//            }
        }

        $results['vader'][$set]['matched-neutral'] += $neutrality;
    }
}

foreach ($results as $tool => $result) {
    foreach ($result as $set => $source) {
        $results[$tool][$set]['matched-ratio-positive'] = ($results[$tool][$set]['matched-positive'] / ($results[$tool][$set]['positive'])) * 100;
        $results[$tool][$set]['matched-ratio-negative'] = ($results[$tool][$set]['matched-negative'] / ($results[$tool][$set]['negative'])) * 100;
    }

}

dump($results);

The code runs inside a Laravel project.

Prepending "and " solves this:

>>> vader(normalize_tokens(tokenize('ok great power'), function ($word) {return strtolower(preg_replace('/\PL/u', '', $word));})) ``PHP Notice: Undefined offset: -1 in /var/mena-tokenrush/vendor/yooper/php-text-analysis/src/Sentiment/Vader.php on line 288

PHP Warnings are treated as exceptions in Laravel apps. PHP Notices are treated as exceptions in our app. This is the reason.

Naive Bayes has its flaws but I may combine 2 or 3 approaches together.

Thank you much for your prompt response !

Hello,

I fixed the software bug that was being triggered when the sentence was too short. You no longer need to add, "and" to the list of tokens. The fix will be available in release version 1.4.1. I got slightly better results than you, after the bug was fixed. Feel free to improve upon the lexicon and contribute your changes back to the project.

array(1) {
'vader' =>
array(3) {
'amazon_cells_labelled.txt' =>
array(9) {
'positive' =>
int(500)
'negative' =>
int(500)
'matched-positive' =>
int(408)
'failed-positive' =>
int(92)
'matched-negative' =>
int(285)
'failed-negative' =>
int(215)
'matched-neutral' =>
int(223)
'matched-ratio-positive' =>
double(81.6)
'matched-ratio-negative' =>
double(57)
}
'imdb_labelled.txt' =>
array(9) {
'positive' =>
int(500)
'negative' =>
int(500)
'matched-positive' =>
int(375)
'failed-positive' =>
int(125)
'matched-negative' =>
int(337)
'failed-negative' =>
int(163)
'matched-neutral' =>
int(163)
'matched-ratio-positive' =>
double(75)
'matched-ratio-negative' =>
double(67.4)
}
'yelp_labelled.txt' =>
array(9) {
'positive' =>
int(500)
'negative' =>
int(500)
'matched-positive' =>
int(397)
'failed-positive' =>
int(103)
'matched-negative' =>
int(263)
'failed-negative' =>
int(237)
'matched-neutral' =>
int(235)
'matched-ratio-positive' =>
double(79.4)
'matched-ratio-negative' =>
double(52.6)
}
}
}

After looking more at the Vader class, I have decided to make the properties within it more accessible. You can now modify the tokens and their weights. Please review the changes in commit ad5b139 for ways to directly modify properties of the class to better improve your results.

Changes will be available in version 1.4.2 release.
I will close this issue tomorrow, if no feedback is received.

Cheers,

bdteo commented

Awesome ! Thank you ! A bigger Lexicon would be the other big improvement but I may suggest it in the Vader repo.