/Document-Classification-using-Doc2Vec

Training Documents using Tags. Predicting both tags and embedding vectors of test document to classify them, and to find the nearest document in train set.

Primary LanguageJupyter Notebook

  • Training Documents using Tags.
  • Predicting both tags and embedding vectors of test document to classify them, and to find the nearest document in train set.

Trying 2 methods.

Use Doc2vec algorithm after extracting text using OCR API.

  1. Two documents which have the most similar Doc2Vec embeddings are similar documents.

2nd method - TODO :Use algorithms that detect similarity of images

  1. Crop the heading part of images
  2. Find a pretrained feature vector online on tfhub.dev or other sources.
  3. Run these pretrained feature vectors on all the templates ( training data) , and store them.
  4. Take any input from the input folder ( test set), get its feature vectors.
  5. Using distance metric like Euclidean, Manhattan to find which image in template is nearest to the Input

External Endpoint for the GKE app

External endpoint Here

TODO

  1. Add tags while training, return tags during prediction # Completed
  2. Predict multiple files at the same time, return a dictionary of outputs
  3. Find if there is a function in gensim for prediction, instead of manually calcuating distances