DocumentClassification

Classifying documents and providing search amongst them using Elasticsearch

Data is collected from 3 sources https://github.com/mihaibogdan10/json-reuters-21578, https://www.kaggle.com/tunguz/200000-jeopardy-questions and https://github.com/yaolu/Multi-XScience. There are approx 12000 documents that are indexed into elastic search
This data is then classified into groups. The classification is based on tf-idf similarity using k-means clustering. The optimal number of clusters are determined through the silhouette method
The documents within a group are ranked using the LDA algorithm (Latent Dirichlet allocation). The top ranked document is the leader of the group.

/api/insert/<index> - Inserts the data into the specified . The insterted data is assigned a document group.
/api/search?q=some text - Performs search by generated topics
/api/eqsearch?q=some text - Performs normal search to identify the document source
/api/classify - Improves document classification by re-running the classification task with the newly added data

PrajwalM2212/DocumentClassification