nltk
: nltk library provides tools for preprocessing (e.g. stopwords, normalization)WordNetLemmatizer
: Fromnltk.stem
, use for word form normalizationstopwords
: Fromnltk.corpus
, use for removing stop words
numpy
math
string
csv
argparse
: Standard libraries are imported for loading data(numpy
math
), math calculation(csv
argparse) and data structure process(
string`)seaborn
matplotlib
: use for drawing confusion matrix
TF-IDF
: The project calculates each word TF-IDF value as the word feature and calculate the probability similar with bayes formulaCombination
: Through testing on the dev data set, we found that the probability obtained by multiplying the probability of tf idf and the probability of Naive Bayes is the most accurate. Thus, we take it as our the core features extraction algorithm.
python3 NB_sentiment_analyser.py moviereviews/train.tsv moviereviews/dev.tsv moviereviews/test.tsv -classes ... -features ... -confusion_matrix -output_files
-classes: 3, 5
-features: features, all_words
-confusion_matrix, `-output_files`: optional (if no input: default `False`)
pip install seaborn
: download for plotting confusion matrixpip install matplotlib
: download for plottingpip install nltk
: download the preprocessing librarynltk.download('wordnet')
: download the dataset for WordNetLemmatizernltk.download('stopwords')
: download the dataset for stopwords
- “Improved Bayes Method Based on TF-IDF Feature and Grade Factor Feature for Chinese Information Classification | IEEE Conference Publication | IEEE Xplore,” ieeexplore.ieee.org. https://ieeexplore.ieee.org/abstract/document/8367204 (accessed Dec. 15, 2023).