Content Description
- execute
work.py
using python3.x - folder
imdb
contains npy's Report.pdf
is the work report
Program Flow
- import necessary modules
- load npy's from
/imdb
- call
freqdist()
calculate frequency distribution of id inx_train
andx_test
- call
topk()
obtain frequency of top-K id inx_train
andx_test
- call
GaussianNaiveBayes()
fit gnb model & calculate accuracy, precision, recall - repeat step 4 ~ 5 for k = 100,1000,10000
Program Output
Calculating freqdist of x_train & x_test...done.
~~~~~~~~~~ K = 100 ~~~~~~~~~~
Obtaining frequency of top-100 words in x_train...done.
Obtaining frequency of top-100 words in x_test...done.
Training gnb model...done.
Accuracy = 0.69168, Precision = 0.70542, Recall = 0.65824
~~~~~~~~~~ K = 1000 ~~~~~~~~~~
Obtaining frequency of top-1000 words in x_train...done.
Obtaining frequency of top-1000 words in x_test...done.
Training gnb model...done.
Accuracy = 0.81004, Precision = 0.82396, Recall = 0.78856
~~~~~~~~~~ K = 10000 ~~~~~~~~~~
Obtaining frequency of top-10000 words in x_train...done.
Obtaining frequency of top-10000 words in x_test...done.
Training gnb model...done.
Accuracy = 0.66128, Precision = 0.76809, Recall = 0.46208
Press any key to exit.
work.py
- written in py3
- imports:
sklearn
,nltk
,numpy
- trains a gaussian nb model using top-k most frequent words (k = 100,1000,1000)
- calculates accuracy, precision, recall of each model
work.py
functions in freqdist()
input: list of id list of each sample
output: list of id freq dist for each sample
- calculates frequency distribution of id in each sample
- also sorts each id list in ascending order
topk()
input: k, list of id freq dist for each sample
output: numpy array of top-K id frequency for each sample
- obtain frequency of top-K id in each sample
- i.e. number of times each word id appeared in each sample
GaussianNaiveBayes()
input: none
output: accuracy, precision, recall
- fits a gnb model using frequency of top-K words
- calculates the accuracy, precision, recall of the model