- The data is prepared from the electoral roll call of the Constetuency present in the region by visiting the Government's one stop Electoral Roll Website from where we have downloaded the pdf by enterning captcha which was a manual process.
- The generated pdf then was converted to OCR enabled pdf.
- The data extracted from OCR was futher processes and cleaned.
- Then the data is converted to json object.
- The data was further moved into MongoDB database
- The API was then being generated using NodeJs, ExpressJs and hosted over AWS instance
- The testing of API endpoints was done using postman.
- We first fetch places from places collection in database and get place id.
- Using place id we fetch electoral roll data of particular place.
- Search the person with the voter id and find the relation with other person based on house no in that particular place.
- Generating the pdf of the person and his relations with his family members as mentioned in the problem statement.
-
Create a .env file and add these two configuration
PORT = <YOUR PORT NUMBER> MONGO_URL = <YOUR MONGODB URI>
-
Click on this link to get the sample_data: https://github.com/jhonsnow456/FamilyTreeAPI/tree/main/sample_data
-
upload the
sample_data
of electorals in json format in your mongoDB electoral collection. -
upload the
sample_data
of places in json format in your mongoDB places collection. -
Use NodeJS version 16LTS or higher
-
Use package manager such yarn or npm as per your choice:
For
npm
npm i
For
yarn
:yarn install
- change directory to
pdf_processing
cd pdf_processing
- Use python version 3.7 or 3.8
- Run the command mentioned below:
$ python3.8 -m venv env
$ source env/bin/activate
$ pip3 install requirements.txt
- Follow the steps mentioned in the next section
Data Collection from pdf
for further processing.
- Install the cli-tool ocrmypdf to process pdf using the below command
Since we are using linux system run
sudo apt install ocrmypdf
- Install pikepdf using command line tool
pip3 install pikepdf
and write the below code to decrypt the fileNote: This happens because of mordern day scanners.import pikepdf pdf = pikepdf.open('data2.pdf') # write your own protected pdf file name pdf.save('data_2.pdf') # decrypted file
- Run the following command in the terminal to get the output ocr pdf file.
ocrmypdf -l eng --deskew --title 'data_.pdf' --job 2 --output-type pdfa data_2.pdf output.pdf
- Now just extract the voter details clean it and convert the data into json format and put the details into the database mentioned above.
- The electoral roll data which we are now using is based on english, however the same procedure can be done to extract other languages.
- The other language is then being translated to the english using python library
translate
. - Thus expanding to length and breadth of our country and incrreasing the size of organised data of voters and their relations.