PDF2XML

Project built under Citi, Pune as summer analyst intern project

Note: The other repository is at https://github.com/EshitaShukla/PDFToXML, where only some parts of the Program ares available. Project structure is different in both reposetories

Problem Statement

Write a PDF to XML utility (tool) by leveraging the pdfbox library so that we can use this tool to compare pdf files to DB tables.

Description

Citi has a framework, which can compare the XML data with database tables and reports the data differences.

However, the system does not support the data comparison between PDF and table data due to which data is compared manually, and this makes it error-prone.

So that once we convert the PDF file to the XML file, it can be injected into the existing Framework to compare the data and find the root cause of the data discrepancy in the PDF files.

Features

A pdf contains:

Table Extraction
Text Extraction
Image Extraction

Our approach is unique because we handle tables separately and text separately. Then, we extract images.

Approach

Table Extraction

More types of tables need to yet be accomodated, in the extraction algorithm.

Text Extraction

Text extraction needs to be generalised for a wider range of "line-spacings"

Softwares & platforms used

JDK Platform version 8
Apache PDFBox version 2.0.19

Execution

The program can be rum in many ways.

Executing using `mainClass.java`

Open the terminal
Change your current working directory (using cd) cd /<your>/<path>/.../PDF2XML/src/main/java/com/example/pdf2xml/mainClass.java
Compile the file javac mainClass.java
Run the file java mainClass

Executing using `.jar` file

If you have obtained the .jar file from the hackerearth portal, follow the below given instructions

Open the terminal
Change your current working directory (using cd): cd your/path/here/
Run the file java -jar PDFtoXML-me.jar

JuiPitale/PDF2XML-Javafx

PDF2XML

Contents

Problem Statement

Description

Features

Approach

Table Extraction

Text Extraction

Softwares & platforms used

Execution

Executing using `mainClass.java`

Executing using `.jar` file

JuiPitale/PDF2XML-Javafx

PDF2XML

Contents

Problem Statement

Description

Features

Approach

Table Extraction

Text Extraction

Softwares & platforms used

Execution

Executing using mainClass.java

Executing using .jar file

Executing using `mainClass.java`

Executing using `.jar` file