Scrapper for igrmaharashtra.gov.in website

Implemented an automated scrapping tool for the website 'https://pay2igr.igrmaharashtra.gov.in/eDisplay/propertydetails'

WalkThrough Video : https://youtu.be/sFuWeqmPp_A

NOTE : This script may not follow the industry standards, but I am willing to learn and in no time, will grasp all that is required for the furthur processes.

Navigate to root directory and

Install required libraries,

pip install -r requirements.txt

Use the scrapper, (execute the 'run.py' file)

python run.py

Steps to follow:

As site opens, let the script enter details until doc year = 2023
Enter captcha manually (will implement auto ocr in future iterations)
Wait for it to search and enter rows
Use google translate extension to translate the whole page

NOTE: A window of 25 seconds is added to scroll down to bottom so that the extension translates the data fully.
If there is more than one page, let the script click on NEXT button, and scroll the page again for translated data. (Auto Scrolling to be added in furthur iterations)

This process also implements data cleaning pipeline with a batchsize of 50 rows/ 1 page.

Subsequently, also adds the data to the POSTGRES table active on the local server.

API Endpoints

Retrieve Scraped Data

Endpoint: /column_name
Method: GET
Description: Retrieve all scraped data from the database.
Returns: all records in JSON

Steps:

Go to Server Folder
Run command
```
python server.py
```
Wait for development server to start.
Copy the link and paste in API testing tool (Ex. POSTMAN API)
Follow example search strings as below

These are just the examples of API endpoints that can be created. SQL queries can be used to search with the database to a good extent

These are the basic examples implemented to demonstrate the working ot api endpoints with postgres table

Example Request:

Request a specific year data

localhost/2023

Request address using partial text search

localhost/address?search=mumbai

Request name using partial text search

localhost/name?search=advocate

Request name using partial text search

localhost/doc_no?search=2449

Data Cleaning Pipeline Walkthrough

Libraries used: pandas, sklearn.base, numpy

Loc: custom_transformers.py

Steps

Dropped any of Nan values
Dropped undesired column list no. 2 ( already scraped link in another)
Renamed columns to database conventions separated with an underscore (ex. buyer_name)
Changed date dtypes from string object to datetime object
Changed float dtypes to int dtypes
Separated the latest buyers and sellers from buyer and seller list columns, created new columns with latest_buyer_name and latest_seller_name
Renamed old buyer and seller columns to buyer_history and seller_history
Removed anomalies from doc_type column

All these steps have been performed using sklearn.pipeline and sklearn.base libraries using Pipeline, TransformerMixin and BaseEstimator class.

No real use of BaseEstimator in this case, helpful for imputing estimated values for numerical data

Inputs in database

Libraries : pandas, sqlalchemy, psycopg2

Loc: Table/table.py

Steps

Connect to database using DB_PARAMS constants
Create sqlalchemy engine and entering database
Read df from the save_table funtion by value
Used to_sql method of postgres with sqlalchemy engine to input values in database
If table exists, options selected is to append values, we the inputs can keep coming from webdriver
Defined POSTGRES_DTYPES constant in Constants/constants.py and used it to define datatypes of database
Done for now, all inputs added

Conclusion

This package demonstrates my abilities to scraping a website and clean the data in transition so that it is ready to be stored in the database.

Most of my learnings while doing this project includes:

BS4 makes inputing structured data scraping easier.
Automating every step of data ingestion can be challenging.
Was weak using SQL queries but, with this I have learnt the capabilities of SQL
There is always a scope of improvement.

Plan for Furthur Iterations

I was suggested to use Docker Image to input data in PostGres. I am unfamilier with the use cases of docker. So will be exploring docker now, to ensure that the ETL flow is able to meet industry standards.

The captcha can be automated using OCR to convert image to text

For now, extension can only be enabled by webdriver, but user need to manually translate page using the extension and scroll

Using JS injection in webpages may help with scrolling and libraries like autogui in python, we can direct the mouse cursor the translate the page for itself

Will keep on exploring for furthur possibilities in automation of scrapping.