Training Documents using Tags. Predicting both tags and embedding vectors of test document to classify them, and to find the nearest document in train set.
Jupyter Notebook
Training Documents using Tags.
Predicting both tags and embedding vectors of test document to classify them, and to find the nearest document in train set.
Trying 2 methods.
Use Doc2vec algorithm after extracting text using OCR API.
Two documents which have the most similar Doc2Vec embeddings are similar documents.
2nd method - TODO :Use algorithms that detect similarity of images
Crop the heading part of images
Find a pretrained feature vector online on tfhub.dev or other sources.
Run these pretrained feature vectors on all the templates ( training data) , and store them.
Take any input from the input folder ( test set), get its feature vectors.
Using distance metric like Euclidean, Manhattan to find which image in template is nearest to the Input