vortex-17/Prodigal-Assessment

Task for Prodigal SDE role

Python

Problem Statement

Design an ingestion pipeline to fetch Mutual Fund data from the AMFI website, model, and store that data into a database for further processing.

Checklist

Pipeline can do a full fetch and load
Initial fetch and loading into database takes less than 30 seconds (more than 3,00,000 entries in less than 30 seconds)
Pipeline supports parallel processing using processes to crawl website for data
Pipeline uses queue system to pass data for DB insertion
Deals with failures and logs them into log files
Secondary indexes created for better query performance
Normalise scheme_name for faster and better queries

Flow

First a pool of multiple processes is created if --inital-fetch=True and each process then starts to crawl the website from a date range. If inital-fetch = False, then no pool is created.
This data is then fetched and cleaned up.
Cleaned up data is sent to a RabbitMQ link
Receiver.py is actively consuming the queue and inserts the data into MongoDB when it gets it.

Tools and frameworks

MongoDB
RabbitMQ
Multiprocessing

Steps to Run Code

Download RabbitMQ docker image and run the command docker run -it --rm --name rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:3.9-management
Run pip install -r requirements.txt
Run python receiver.py
Run python main.py --initial-fetch=True

Scope of improvement

Dockerize the application and create a deployable solution
Transform data for more readability and analysis
Scripts to automate calling of python scripts
Better error handling and script to read log files and analyse it