- The story starts with Web Scraping Yahoo Finance Stocks on an AWS EC2 instance using a python script that's error proof equipped with a try and error code to run continuously so that it can scrape data in real-time.
- Data is pushed to Kinesis Data Streams using the python SDK library.
- Data is ingested by Kinesis Data Streams and pushed to two destinations which are AWS Lambda and Kinesis Firehose.
- Firehose sends the data to two S3 buckets, one is for backup.
- Athena can then be used to query the data from S3.
- Quick Sight is connected to Athena to visualize the data.
- Data is extracted from Yahoo Finance asynchronously in a JSON format using a python then it's sent to Kinesis Data Streams using the python SDK library.
- Data is cleaned after scraping it on EC2 using the same python script.
- Data is pushed into Kinesis Data Streams which sends the data into two destinations:
- AWS Lambda:
- When the data is pushed to Lambda it prepares the data so that it can be sent to InfluxDB.
- Grafana is used to visualize the data in realtime by pulling the data from InfluxDB.
- Kinesis Firehose:
- Before processing the data ,Firehose dumps the unprocessed data after patching it into a backup bucket.
- Firehose process the data and transforms the data from a JSON format to a Parquet format with the help of Glue Data Catalog.
- The processed data is patched and stored in a separate bucket.
- Transforming the data happens by defining a table schema in AWS Glue and then Firehose uses the table to transform the data to parquet.
- If there were further transformations on the data. It would require us to use Lambda but since it's a simple transformation doing it inside firehose makes the infrastructure much simpler.
- The data is transformed to parquet in order to query it with Athena efficiently.
- AWS Lambda:
- Data is visualized in realtime by sending the data to Lambda then Lambda prepares the data and sends it to InfluxDB.
- Grafana pulls the data from InfluxDB and the data is send used for visualizations.