/Classifier

Semi-Supervised Text/Document Classification using Complementary NaiveBayes

Primary LanguageJava

Classifier

Automated Document Classifier using Complementary NaiveBayes Algorithm.

Trainer.java Takes unprocessed data set and produces processed dataset as suitable for Mahout file format. Responsible for training Complementary Naive bayes algorithm and build a statistical model.

Classifier.java Takes an unclassified data directory and classifies the documents. Creates separate subdirectories for each category and writes the files onto the directory.

Setting Up Parameters in settings.properties file Bayesparameters

Gramsize=2 // Ngram size Algorithm=cbayes // our classification algorithm DefaultCategory=unknown // Default Category DataSource=hdfs // Hadoop File System Encoding=UTF-8 // Unicode Alpha=1.0 //Smoothing parameter

For Trainer.java

TrainSet=/home/developer/dataset_rev/freshrevs/train/ // training set location which containing subdirectories of each category ProcessedSet=/home/developer/dataset_rev/freshrevs/processedTrain/ // Processed Output Directory

For Classifier.java

ModelPath=/home/developer/dataset_rev/freshrevs/model/ // Path to store and retrieve Model IpDirPath=/home/developer/dataset_rev/freshrevs/test/pos/ // Unclassifed data set OpDirPath=/home/developer/dataset_rev/freshrevs/classified/ // Path to store classified documents