Invoice Mining Assignment

Approach
The problem revolves around extracting the data corresponding to the year 2019 and all the line items given in the borderless table. 
This is not a conventional NLP problem as the spatial information regarding the alignment of data is quite important in this scenario, 
which cannot be obtained from conventional NLP techniques.

So my approach is a combination of computer vision and NLP ( gathering spatial information using tesseract and classification using naïve Bayes). 

Steps undertaken are as follows:
1)	Convert the given text to an image using ascii art algorithm. ( txttoimg.py )
2)	To Identify the line Items or table rows from the rest of the component, 
        I had to train a  Naive Bayes model to classify the rows as relevant or not. 
        For this purpose I collected all the lines of every textfile and manually annotated 1000 lines ( dataCollectionNLP.py )
3)	Using the data in step 2 I trained a Model for predicting the line tems. ( createModelNLP.py, predict.py )
4)	Iterate through each image from step 1 in a row-wise(line by line) manner while Obtaining spatial information of each sparse line. 
        Using the model in step 3 identify the table records, now identify the corresponding sparse record (line/word) which is occurring 
        under the year 2019 by comparing the x co-ordinates of the words.
        Format output as per requirement and write to csv. ( invoiceReader_Main_Program.py )