Repository for the Natural Language Processing Course Project
- Anant Kandikuppa B120519CS
- Hemant Pugaliya B120787CS
- Pranay Dhariwal B120762CS
- Python(version 2.7)
- Numpy
pip install -U numpy
- scikit-learn
pip install -U scikit-learn
- data-collection/ : Contains our chosen dataset
- stop-word-removal/ : Contains cleaned dataset without stopwords and the requisite python code for achieving the same
- vocabulary/ : Contains the vocabulary file and the code required to obtain it
- evaluate.py : Code for performing 10 fold cross validation
- model.p : Pickled model obtained by training our model on the entire dataset
- model.py : Contains the definition for the NaiveBayesClassifier Class used to represent our model
- train-model.py : Code to train and pickle a model
The NaiveBayesClassifier Class defined in model.py is used to represent a unigram model.
It has the following attributes:
positive_count
: A dictionary storing the counts of each word in the positive classnegative_count
: A dictionary storing the counts of each word in the positive classvocab_length
: The length of vocabulary obtained from the entire corpuspos_corpus_length
: The count of words belonging to the positive class in the corpusneg_corpus_length
: The count of words belonging to the negative class in the corpuspositive_lines
: The number of sentences that belong to the positive class in the datasetnegative_lines
: The number of sentences that belong to the negative class in the dataset
And provides the following methods:
__init__(self)
: Default constructortrain_from_file(self,data_file)
: Trains a model instance using the data read fromdata_file
train(self,data,labels)
: Trains a model using list of sentences asdata
and their corresponding labels aslabel
test(self,test_data)
: Returns a set of predicted labels corresponding to the list of sentences passed astest_data
pos_word_prob(self,word)
: Returns the conditional probability of a word being in the positive classneg_word_prob(self,word)
: Returns the conditional probability of a word being in the negative classclassify(self,line)
: Returns the predicted class of a sentence passed asline
write_counts(self,line)
: Prints the counts of attributes for debugging purposes
We obtained an accuracy of 91.5% after 10 fold cross validation of the model on our training data.