https://github.com/DPigeon/NLP-Language-Classifier
A Naive Bayes classification for NLP to determine the most likely language of a tweet
First, install Miniconda with Python 3.7 at
https://docs.conda.io/en/latest/miniconda.html
You also need NumPy to run the project.
Install NumPy with
conda install numpy
To run the program, you must create an output folder in the root of the project. Then, you must edit the input.txt file in input folder. The input file text is made as follow:
vocabulary size_of_ngram smoothing_value training_file testing_file
Where the vocabulary is either
0
Fold the corpus to lowercase and use only the 26 letters of the alphabet [a-z]
1
Distinguish up and low cases and use only the 26 letters of the alphabet [a-z, A-Z]
2
Distinguish up and low cases and use all characters accepted by the built-in isalpha() method
Where the size of ngram is either
1
character unigrams
2
character bigrams
3
character trigrams
Smoothing value is a smoothing between [0, 1].
The trace file will give an output as follows:
tweet_id most_likely_class score_most_likely_class correct_class correct_wrong_label
Where the correct and wrong label.
The evaluation file will give an output as follows:
accuracy
eu_precision ca_precision gl_precision es_precision en_precision pt_precision
eu_recal ca_recall gl_recall es_recall en_recall pt_recall
eu_f1_measure ca_f1_measure gl_f1_measure es_f1_measure en_f1_measure pt_f1_measure
macro_f1 weighted_average_f1