Project built under Citi, Pune as summer analyst intern project
Note: The other repository is at https://github.com/EshitaShukla/PDFToXML, where only some parts of the Program ares available. Project structure is different in both reposetories
Write a PDF to XML utility (tool) by leveraging the pdfbox library so that we can use this tool to compare pdf files to DB tables.
Citi has a framework, which can compare the XML data with database tables and reports the data differences.
However, the system does not support the data comparison between PDF and table data due to which data is compared manually, and this makes it error-prone.
So that once we convert the PDF file to the XML file, it can be injected into the existing Framework to compare the data and find the root cause of the data discrepancy in the PDF files.
A pdf contains:
- Table Extraction
- Text Extraction
- Image Extraction
Our approach is unique because we handle tables separately and text separately. Then, we extract images.
More types of tables need to yet be accomodated, in the extraction algorithm.
Text extraction needs to be generalised for a wider range of "line-spacings"
- JDK Platform version 8
- Apache PDFBox version 2.0.19
The program can be rum in many ways.
- Open the terminal
- Change your current working directory (using
cd
)cd /<your>/<path>/.../PDF2XML/src/main/java/com/example/pdf2xml/mainClass.java
- Compile the file
javac mainClass.java
- Run the file
java mainClass
If you have obtained the .jar file from the hackerearth portal, follow the below given instructions
- Open the terminal
- Change your current working directory (using
cd
):cd your/path/here/
- Run the file
java -jar PDFtoXML-me.jar