Design an ingestion pipeline to fetch Mutual Fund data from the AMFI website, model, and store that data into a database for further processing.
- Pipeline can do a full fetch and load
- Initial fetch and loading into database takes less than 30 seconds (more than 3,00,000 entries in less than 30 seconds)
- Pipeline supports parallel processing using processes to crawl website for data
- Pipeline uses queue system to pass data for DB insertion
- Deals with failures and logs them into log files
- Secondary indexes created for better query performance
- Normalise scheme_name for faster and better queries
- First a pool of multiple processes is created if --inital-fetch=True and each process then starts to crawl the website from a date range. If inital-fetch = False, then no pool is created.
- This data is then fetched and cleaned up.
- Cleaned up data is sent to a RabbitMQ link
- Receiver.py is actively consuming the queue and inserts the data into MongoDB when it gets it.
- MongoDB
- RabbitMQ
- Multiprocessing
- Download RabbitMQ docker image and run the command
docker run -it --rm --name rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:3.9-management
- Run
pip install -r requirements.txt
- Run
python receiver.py
- Run
python main.py --initial-fetch=True
- Dockerize the application and create a deployable solution
- Transform data for more readability and analysis
- Scripts to automate calling of python scripts
- Better error handling and script to read log files and analyse it