nlp-project

Repository for the Natural Language Processing Course Project

Team Members

Anant Kandikuppa B120519CS
Hemant Pugaliya B120787CS
Pranay Dhariwal B120762CS

Dependencies

Python(version 2.7)
Numpy pip install -U numpy
scikit-learn pip install -U scikit-learn

data-collection/ : Contains our chosen dataset
stop-word-removal/ : Contains cleaned dataset without stopwords and the requisite python code for achieving the same
vocabulary/ : Contains the vocabulary file and the code required to obtain it
evaluate.py : Code for performing 10 fold cross validation
model.p : Pickled model obtained by training our model on the entire dataset
model2.p : Pickled model obtained by training our model on the entire dataset (for Part B)
model.py : Contains the definition for the NaiveBayesClassifier Class used to represent our unigram model
model2.py : Contains the definition for the NaiveBayesClassifier Class used to represent our bigram model
train-model.py : Code to train and pickle a model

Unigram Model and Bigram Model

The NaiveBayesClassifier Class defined in model.py and model2.py is used to represent a unigram model and a bigram model respectively.

It provides the following methods:

__init__(self) : Default constructor
train_from_file(self,data_file): Trains a model instance using the data read from data_file
train(self,data,labels): Trains a model using list of sentences as data and their corresponding labels as label
test(self,test_data) : Returns a set of predicted labels corresponding to the list of sentences passed as test_data
pos_word_prob(self,word): Returns the conditional probability of a word being in the positive class
neg_word_prob(self,word) : Returns the conditional probability of a word being in the negative class
classify(self,line) : Returns the predicted class of a sentence passed as line
write_counts(self,line) : Prints the counts of attributes for debugging purposes

Results

For Part A,we obtained an accuracy of 91.9% after 10 fold cross validation of the model on our training data. For Part B,we obtained an accuracy of 92.0% after 10 fold cross validation of the model on our training data.

pranaydhariwal/nlp-project

nlp-project

Team Members

Dependencies

Contents

Unigram Model and Bigram Model

Results