Classifying documents and providing search amongst them using Elasticsearch
- Data is collected from 3 sources https://github.com/mihaibogdan10/json-reuters-21578, https://www.kaggle.com/tunguz/200000-jeopardy-questions and https://github.com/yaolu/Multi-XScience. There are approx 12000 documents that are indexed into elastic search
- This data is then classified into groups. The classification is based on tf-idf similarity using k-means clustering. The optimal number of clusters are determined through the silhouette method
- The documents within a group are ranked using the LDA algorithm (Latent Dirichlet allocation). The top ranked document is the leader of the group.
/api/insert/<index>
- Inserts the data into the specified . The insterted data is assigned a document group./api/search?q=some text
- Performs search by generated topics/api/eqsearch?q=some text
- Performs normal search to identify the document source/api/classify
- Improves document classification by re-running the classification task with the newly added data