A simple project that uses NLP and ML to categorize documents in your Google Drive into separate folders based on subject matter using K-Means clustering.
Created at HooHacks 2019 by Hassan Syyid, Andrei Freund, and Calvin Godfrey.
Awarded "Best Use of Google Cloud Platform"
This repository contains the backend of the project, a combination of NodeJS and Python both running on Google Functions. During the hackathon, we built a simple, barebones iOS interface for the purpose of demoing what we had created.
The NodeJS backend is designed to interface with a frontend, which provides the OAuth Google credentials to access the user's Google Drive. The Python backend is then invoked, retrieving the text from these documents, running the NLP and ML analysis, and producing the sorted documents.
The Python end uses NLTK and Sklearn to vectorize the text and run the KMeans clustering algorithm.
Since this project was done in a time constrained environment, we used an unsupervised algorithm (as we didn't have the resources or time to create a good corpus of data) - thus, the results are not stellar.
However, the results we produced with our limited testing data shows a fair amount of accuracy in terms of pairing subjects together. This suggests that with further refinement and the introduction of a supervised element automatically sorting documents by subject matter is a viable goal that can be achieved through ML.
Obviously, another drawback is that the groups the documents are sorted into are largely unlabeled - this means that the user must look through the results to label the clusters. We could train a separate ML algorithm using training data generated by users labeling the clusters to automatically assign a subject label to each cluster created.