A solution to extract tabular data from PDF and Image Files
Install Requirements
pip install -r requirements.txt
Run flask app (server)
sudo python app.py
Web app - Open webapp/home.html in browser
Using pyPDF module, the number of pages present inside the PDF is extracted for further iteration
Follow the commands below to cd into data directory and convert image to searchable pdf.
cd TableExtraction/PDF Module/
python table_extract.py
Then the programme displays a prompt as shown below to enter the name of PDF file.
Enter the pdf you want to extract the table --> 006
The above line loads the 015.pdf file from the dataset and extracts the table content out of it.
Now, all the pages present in the pdf are segmented individually with the name Page_0.pdf
, Page_1.pdf
, so on upto the last page of the pdf. Advantages of Page Segmentation is that,
- Boosts the speed of algorithm by reducing the file size.
- Reduces Spurious inputs to algorithm.
- Enables to recognise the exact location of table.
Sample Page Segmentation view
Tabula-py is a python library which is written upon the java. It uses python commands to recieve the arguments and invoke the .jar
files in order to find the tables in a pdf.
for i in range(0,pag_no):
convert_into('Page_'+str(i)+'.pdf', 'result_'+str(i)+'.csv', output_format = 'CSV')
convert_into('Page_'+str(i)+'.pdf', 'result_'+str(i)+'.xml', output_format = 'xml')
The above code is used to iterate over all the Page_.pdf
files to extract the table data.
The tables extracted are stored in the .CSV
format, which enables the user to directly access the tables in pdf's without manual entry.
results_1.csv
results_2.csv
......
The above shown is the format of output result logging which contains the table information. The found tables in the pdf are shown in the following format
Table found in -----> PAGE3 and stored in -----> result_0.csv
PDF file with table in it's 3rd Page.
Image of result extracted with the Table Information into the CSV file.
- No Preprocesssing of PDF's is required.
- Faster processing due to Page segmentation technique.
- Higher Accuracy to even noisy pages.
- Better ROI(Region of Interest) extraction and higher text rejection rate.
Generated output CSV files in PDF Module/pdfname
Using tesseract for OCR on input image to produce a sandwich pdf with existing image and extracted OCR data
Follow the commands below to cd into data directory and convert image to searchable pdf.
cd TableExtraction/Image Module/data
tesseract 29.jpg 29 -l eng pdf
pdftohtml -c -hidden -xml 29.pdf 29.xml
Find sample XML in Image/data folder
python extract.py
Generated output images and CSV files in Image/generated_output folder