This programs implements a naive bayes classifier using pyhton.
- Training data: 700 positive reviews and 700 negative reviews
- Testing data: 300 positive reviews and 300 negatives reviews
Implement a dictionary
convert a text document into a feature vector:
BOWDj = transf er(f ileDj, vocabulary)
where fileDj is the location of file j.
Read in the training and test documents into BOW vector representations using the above function. Then store features into matrix Xtrain and Xtest, and use ytrain and ytest to store the labels.
Xtrain, Xtest, ytrain, ytest = loadData(textDataSetsDirectoryFullPath)
– “textDataSetsDirectoryFullPath” is the real full path of the file directory that you get from
unzipping the datafile. For instance, it is “/HW3/data sets/” on the instructor’s laptop.
– loadData should call transfer()
- We need to learn the P(cj) and P(wi|cj) through the training set. Through MLE, we use the relative- frequency estimation with Laplace smoothing to estimate these parameters.
- Since we have the same number of positive samples and negative samples, P(c = −1) = P(c = 1) = 1 .
thetaPos,thetaNeg = naiveBayesMulFeature train(Xtrain,ytrain)
Note: Pay attention to the MLE estimator plus smoothing; Here we choose α = 1.
Note: thetaPos and thetaNeg should be python lists or numpy arrays (both 1-d vectors)
yPredict,Accuracy = naiveBayesMulFeature test(Xtest,ytest,thetaPos,thetaNeg)
- Use ”sklearnn ̇aive bayes.MultinomialNB” from the scikit learn package to perform training and testing. Compare the results with your MNBC. Add the resulting Accuracy into the writeup.
Important: Do not forget perform log in the classification process.
- For the step of classifying a test sample using MNBC, It is actually not necessary to first perform the BOW transformation for feature vectors.
yPredictOne = naiveBayesMulFeature testDirectOne(XtestTextFileNameInFullPathOne,thetaPos,thetaNeg)
- Use the above function on all the possible testing text files, calculate the ”classification accuracy” based on ”yPredict” versus the testing label.
yPredict, Accuracy = naiveBayesMulFeature testDirect(testFileDirectoryFullP ath, thetaPos, thetaNeg)
- We need to learn the P(cj), P(wi = false|cj) and P(wi = true|cj) through the training. MLE gives the relative-frequency as the estimation of parameters. We will add with Laplace smoothing for estimating these parameters.
thetaPosTrue,thetaNegTrue = naiveBayesBernFeature train(Xtrain,ytrain)
yPredict,Accuracy = naiveBayesBernFeature test(Xtest,ytest,thetaPosTrue,thetaNegTrue)
Not surprisingly, the algorithm with continuous taken into consideration is more effective than the one without. Our original algorithm has an average of 0.675 accuracy of all the accuracy.
Overall, this project demonstrated that NaiveBayes algorithm is very easy to implement and gives a pretty reliable result.