/mlhw1q2

ml hw1 q2

Primary LanguageJupyter Notebook

Content Description

  • execute work.py using python3.x
  • folder imdb contains npy's
  • Report.pdf is the work report

Program Flow

  1. import necessary modules
  2. load npy's from /imdb
  3. call freqdist()
    calculate frequency distribution of id in x_train and x_test
  4. call topk()
    obtain frequency of top-K id in x_train and x_test
  5. call GaussianNaiveBayes()
    fit gnb model & calculate accuracy, precision, recall
  6. repeat step 4 ~ 5 for k = 100,1000,10000

Program Output

Calculating freqdist of x_train & x_test...done.
~~~~~~~~~~  K = 100  ~~~~~~~~~~
Obtaining frequency of top-100 words in x_train...done.
Obtaining frequency of top-100 words in x_test...done.
Training gnb model...done.
Accuracy = 0.69168, Precision = 0.70542, Recall = 0.65824

~~~~~~~~~~  K = 1000  ~~~~~~~~~~
Obtaining frequency of top-1000 words in x_train...done.
Obtaining frequency of top-1000 words in x_test...done.
Training gnb model...done.
Accuracy = 0.81004, Precision = 0.82396, Recall = 0.78856

~~~~~~~~~~  K = 10000  ~~~~~~~~~~
Obtaining frequency of top-10000 words in x_train...done.
Obtaining frequency of top-10000 words in x_test...done.
Training gnb model...done.
Accuracy = 0.66128, Precision = 0.76809, Recall = 0.46208

Press any key to exit.

work.py

  • written in py3
  • imports: sklearn, nltk, numpy
  • trains a gaussian nb model using top-k most frequent words (k = 100,1000,1000)
  • calculates accuracy, precision, recall of each model

functions in work.py

freqdist()

input: list of id list of each sample
output: list of id freq dist for each sample

  • calculates frequency distribution of id in each sample
  • also sorts each id list in ascending order

topk()

input: k, list of id freq dist for each sample
output: numpy array of top-K id frequency for each sample

  • obtain frequency of top-K id in each sample
  • i.e. number of times each word id appeared in each sample

GaussianNaiveBayes()

input: none
output: accuracy, precision, recall

  • fits a gnb model using frequency of top-K words
  • calculates the accuracy, precision, recall of the model