Repository for the Natural Language Processing Course Project
- Anant Kandikuppa B120519CS
- Hemant Pugaliya B120787CS
- Pranay Dhariwal B120762CS
- Python(version 2.7)
- Numpy
pip install -U numpy
- scikit-learn
pip install -U scikit-learn
- data-collection/ : Contains our chosen dataset
- stop-word-removal/ : Contains cleaned dataset without stopwords and the requisite python code for achieving the same
- vocabulary/ : Contains the vocabulary file and the code required to obtain it
- evaluate.py : Code for performing 10 fold cross validation
- model.p : Pickled model obtained by training our model on the entire dataset
- model2.p : Pickled model obtained by training our model on the entire dataset (for Part B)
- model.py : Contains the definition for the NaiveBayesClassifier Class used to represent our unigram model
- model2.py : Contains the definition for the NaiveBayesClassifier Class used to represent our bigram model
- train-model.py : Code to train and pickle a model
The NaiveBayesClassifier Class defined in model.py and model2.py is used to represent a unigram model and a bigram model respectively.
It provides the following methods:
__init__(self)
: Default constructortrain_from_file(self,data_file)
: Trains a model instance using the data read fromdata_file
train(self,data,labels)
: Trains a model using list of sentences asdata
and their corresponding labels aslabel
test(self,test_data)
: Returns a set of predicted labels corresponding to the list of sentences passed astest_data
pos_word_prob(self,word)
: Returns the conditional probability of a word being in the positive classneg_word_prob(self,word)
: Returns the conditional probability of a word being in the negative classclassify(self,line)
: Returns the predicted class of a sentence passed asline
write_counts(self,line)
: Prints the counts of attributes for debugging purposes
For Part A,we obtained an accuracy of 91.9% after 10 fold cross validation of the model on our training data. For Part B,we obtained an accuracy of 92.0% after 10 fold cross validation of the model on our training data.