Implemented an automated scrapping tool for the website 'https://pay2igr.igrmaharashtra.gov.in/eDisplay/propertydetails'
WalkThrough Video : https://youtu.be/sFuWeqmPp_A
NOTE : This script may not follow the industry standards, but I am willing to learn and in no time, will grasp all that is required for the furthur processes.
Navigate to root directory and
Install required libraries,
pip install -r requirements.txt
Use the scrapper, (execute the 'run.py' file)
python run.py
Steps to follow:
-
As site opens, let the script enter details until doc year = 2023
-
Enter captcha manually (will implement auto ocr in future iterations)
-
Wait for it to search and enter rows
-
Use google translate extension to translate the whole page
NOTE: A window of 25 seconds is added to scroll down to bottom so that the extension translates the data fully.
-
If there is more than one page, let the script click on NEXT button, and scroll the page again for translated data. (Auto Scrolling to be added in furthur iterations)
This process also implements data cleaning pipeline with a batchsize of 50 rows/ 1 page.
Subsequently, also adds the data to the POSTGRES table active on the local server.
- Endpoint:
/column_name
- Method: GET
- Description: Retrieve all scraped data from the database.
- Returns: all records in JSON
Steps:
- Go to Server Folder
- Run command
python server.py
- Wait for development server to start.
- Copy the link and paste in API testing tool (Ex. POSTMAN API)
- Follow example search strings as below
These are just the examples of API endpoints that can be created. SQL queries can be used to search with the database to a good extent
These are the basic examples implemented to demonstrate the working ot api endpoints with postgres table
Example Request:
- Request a specific
year
data
localhost/2023
- Request address using partial text search
localhost/address?search=mumbai
- Request name using partial text search
localhost/name?search=advocate
- Request name using partial text search
localhost/doc_no?search=2449
Libraries used: pandas, sklearn.base, numpy
Loc: custom_transformers.py
Steps
- Dropped any of Nan values
- Dropped undesired column list no. 2 ( already scraped link in another)
- Renamed columns to database conventions separated with an underscore (ex. buyer_name)
- Changed date dtypes from string object to datetime object
- Changed float dtypes to int dtypes
- Separated the latest buyers and sellers from buyer and seller list columns, created new columns with latest_buyer_name and latest_seller_name
- Renamed old buyer and seller columns to buyer_history and seller_history
- Removed anomalies from doc_type column
All these steps have been performed using sklearn.pipeline and sklearn.base libraries using Pipeline, TransformerMixin and BaseEstimator class.
No real use of BaseEstimator in this case, helpful for imputing estimated values for numerical data
Libraries : pandas, sqlalchemy, psycopg2
Loc: Table/table.py
Steps
- Connect to database using DB_PARAMS constants
- Create sqlalchemy engine and entering database
- Read df from the save_table funtion by value
- Used to_sql method of postgres with sqlalchemy engine to input values in database
- If table exists, options selected is to append values, we the inputs can keep coming from webdriver
- Defined POSTGRES_DTYPES constant in Constants/constants.py and used it to define datatypes of database
- Done for now, all inputs added
This package demonstrates my abilities to scraping a website and clean the data in transition so that it is ready to be stored in the database.
Most of my learnings while doing this project includes:
- BS4 makes inputing structured data scraping easier.
- Automating every step of data ingestion can be challenging.
- Was weak using SQL queries but, with this I have learnt the capabilities of SQL
- There is always a scope of improvement.
I was suggested to use Docker Image to input data in PostGres. I am unfamilier with the use cases of docker. So will be exploring docker now, to ensure that the ETL flow is able to meet industry standards.
The captcha can be automated using OCR to convert image to text
For now, extension can only be enabled by webdriver, but user need to manually translate page using the extension and scroll
Using JS injection in webpages may help with scrolling and libraries like autogui in python, we can direct the mouse cursor the translate the page for itself
Will keep on exploring for furthur possibilities in automation of scrapping.