/mejaj

Sentiment analysis research scripts.

Primary LanguagePHP

FEATURE BASED SENTIMENT ANALYSIS ON TWITTER
===

This file gives a folder-wise description of the files contained.

All files contain the directions to use them as comments.
Files named as 'RUN_' are the executable files. Any parameters, if 
required, are to be specified in this file.

All files are written in PHP except those in the folder 'label_testset'.
All executable file containes the required hashbang.


_ctf
----

This folder contains all the tweets collected by the system. It is an 
unprocessed set and containes repeated tweets.

_datasets
---------

This folder contains two files.
The 'd.emoWrd.2011-07-09' file is the dataset that we used to train our 
classifier. It containes about 1.5 million tweets labeled as 'positive' 
or 'negative'.
The 'd.twitterSentiment' file is a similar dataset which is available 
at twittersentiment.appspot.com.

_testsets
---------

This folder contains the testsets that we used to train our classifier.
The 'mejaj.testset' file is the final test set we use.
All other files are randomly generated test sets.

_wb.processed
-------------

This folder contains the wordbanks used by the classifier. The file 
names specifies the feature selection methods used and their thresholds.

_wordbanks
----------

This folder contains the wordbank files generated by our trainer.
The files named 'emoWrd' are the word bank for our data set, while 
'twitterSentiment' are of the dataset available at 
twittersentiment.appspot.com.

collect_tweets
--------------

This folder contains files required to collect tweets.
The 'timed_collector' file needs to be edited and run.

feature_extractor
-----------------

This folder contains the feature extractor used by our classifier.

naive_bayes_classifier
----------------------

This folder contains the classifier file. The wordbank to be used has 
to be specified in the 'naive_bayes_classifier' file.
The 'find_accuracy' file calculates the accuracy.

prepare_dataset
---------------

This folder contains the files needed to prepare the dataset from raw 
tweets. The list of keywords to be replace has to be specified in the 
'ListKeyword' file. The raw files should be placed in the 
'DirTweetFiles' directory.

prepare_testset
---------------

This folder contains all files needed to prepare the test set.
The 'prepare_testset' file randomly generates a test set.
'count_char_freq' contains files to count the character frequency of 
the test sets an datasets.
The character frequencies are reguired to calculate cross entropy by 
the files contained in the folder 'find_cross_entropy'.
The 'label_testset' folder containes files to remove unwanted 
tweets and manually label the training set.

trainer
-------

This folder containes files to create the wordbank from the dataset.
The 'count_word_frequency' file creates the wordbank.
The 'feature_selector' folder contains files to apply feature selection 
to the wordbanks.