Content Description

execute work.py using python3.x
folder imdb contains npy's
Report.pdf is the work report

Program Flow

import necessary modules
load npy's from /imdb
call freqdist()
calculate frequency distribution of id in x_train and x_test
call topk()
obtain frequency of top-K id in x_train and x_test
call GaussianNaiveBayes()
fit gnb model & calculate accuracy, precision, recall
repeat step 4 ~ 5 for k = 100,1000,10000

Program Output

Calculating freqdist of x_train & x_test...done.
~~~~~~~~~~  K = 100  ~~~~~~~~~~
Obtaining frequency of top-100 words in x_train...done.
Obtaining frequency of top-100 words in x_test...done.
Training gnb model...done.
Accuracy = 0.69168, Precision = 0.70542, Recall = 0.65824

~~~~~~~~~~  K = 1000  ~~~~~~~~~~
Obtaining frequency of top-1000 words in x_train...done.
Obtaining frequency of top-1000 words in x_test...done.
Training gnb model...done.
Accuracy = 0.81004, Precision = 0.82396, Recall = 0.78856

~~~~~~~~~~  K = 10000  ~~~~~~~~~~
Obtaining frequency of top-10000 words in x_train...done.
Obtaining frequency of top-10000 words in x_test...done.
Training gnb model...done.
Accuracy = 0.66128, Precision = 0.76809, Recall = 0.46208

Press any key to exit.

`work.py`

written in py3
imports: sklearn, nltk, numpy
trains a gaussian nb model using top-k most frequent words (k = 100,1000,1000)
calculates accuracy, precision, recall of each model

functions in `work.py`

`freqdist()`

input: list of id list of each sample
output: list of id freq dist for each sample

calculates frequency distribution of id in each sample
also sorts each id list in ascending order

`topk()`

input: k, list of id freq dist for each sample
output: numpy array of top-K id frequency for each sample

obtain frequency of top-K id in each sample
i.e. number of times each word id appeared in each sample

`GaussianNaiveBayes()`

input: none
output: accuracy, precision, recall

fits a gnb model using frequency of top-K words
calculates the accuracy, precision, recall of the model

alexxss/mlhw1q2

Content Description

Program Flow

Program Output

work.py

functions in work.py

freqdist()

topk()

GaussianNaiveBayes()

`work.py`

functions in `work.py`

`freqdist()`

`topk()`

`GaussianNaiveBayes()`