ds-hack

End goal

Take a variety of document types, process into text and surface a front end with a search facility, classification report and geomapping.

Pipeline was as follows:

Convert inputs to text:
1a. Image to text - Convert image to text (handwritten and printed text) via Azure Computer Vision OCR API
1b. XHTML to text - Extract text from XHTML using Beautiful Soup
1c. PDF to text - Use Pytesseract to convert PDF to text
Text to Database - send text to CosmosDB using pymongo Azures sdk
Enhance Database - Entity recogniton, NLP preprocessing (e.g. lemmatization, stopwords) and geocoding.
Modelling - Peform TFIDF and Word2Vec and produce clusters and document similarity
Surfacing - Front end in Flask, hosted on Azure. Can search and return modelling results.