This is a text classifier program implemented in Java that uses unsupervised learning to classify unstructured text data. This project was completed in the Big Data Sciences course at NYU with Professor Anasse Bari. This program takes in text files, uses NLP techniques and the Stanford NLP simple library to preprocess the files, find top keywords, create a tf-idf word document matrix and cluster each article based on its similarity to the others using K-means. F-measure, precision, recall, and a confusion matrix was used as performance metrics. Unknown documents are also assigned their most likely cluster through an implementation of the KNN algorithm. *** Bds.java is my main class *** To run program: 1. unzip file **** My program requires Stanford NLP that contains Sentence and Lemmatization **** First try running with the dependencies you already have, if that doesnt work... 2. download stanford corenlp from here https://stanfordnlp.github.io/CoreNLP/download.html choose English download version 3.9.2 3. the specific referenced libraries used in my program are xom.jar protobuff.jar stanfor_corenlp-3.9.2-models.jar stanfor_corenlp-3.9.2.jar These jars must all be part of the class path, a screenshot of my eclipse environment is included 4. If using eclipse, it can run as is, otherwise the jars should be exported as a runnable jar file saved as HW4/BDSHW/myjars.jar 5. From the command line make sure you are in HW4/BDSHW/ and run: javac -cp myjars.jar Bds.java 6. Next run: java -cp myjars.jar Bds This will show output from my program, however the k cannot be updated by editing the program in sublime, it would have to be updated through eclipse Settings/Good to know: 1. Default KNN k is 3. To change KNN k value, manually change in Bds.java line 15 line15 public static final int UNK_K = 8; ///update this for KNN 2. All paths to files can be found in Bds.java, none should need to be updated line10 public static String datafilename = "data.txt"; line11 public static String stopwordsfilename = "stopwords.txt"; line12 public static String unknownsfilename = "datahw/"; 3. To change method of similarity comparison: Bds.java line 37 -True for cosine, False for euclidean -default is cosine line37 Similarity.kmeans(myMatrix, K, true); 4. K++ means is implemented in Similarity as default. To change to normal kmeans, comment out line 205 in Similarity.java and uncomment line 206 line205 means = init_kpp_clusters(myMatrix); //k++ means line206 //means = init_clusters(myMatrix); //k-means Notes: 1. Word output is ordered by frequency, with most frequent at the top To see the wordcount too, update commented out string in Word.java line 64
vangul01/K-Means-Text-Classifier
Java implementation of K-means and KNN algorithm transforming unstructured data in order to classify text into categories
Java