AhmedZiada-DE/YahooFinanceStocksETL

Web scraping Yahoo Finance then pushing the data to Kinesis Data Streams

Python

Yahoo Finance Stocks ETL pipeline

Overview

The story starts with Web Scraping Yahoo Finance Stocks on an AWS EC2 instance using a python script that's error proof equipped with a try and error code to run continuously so that it can scrape data in real-time.
Data is pushed to Kinesis Data Streams using the python SDK library.
Data is ingested by Kinesis Data Streams and pushed to two destinations which are AWS Lambda and Kinesis Firehose.
Firehose sends the data to two S3 buckets, one is for backup.
Athena can then be used to query the data from S3.
Quick Sight is connected to Athena to visualize the data.

Realtime ETL Pipeline

Extracting the data

Data is extracted from Yahoo Finance asynchronously in a JSON format using a python then it's sent to Kinesis Data Streams using the python SDK library.
Data is cleaned after scraping it on EC2 using the same python script.

Loading and Transforming the data

Data is pushed into Kinesis Data Streams which sends the data into two destinations:
- AWS Lambda:
  - When the data is pushed to Lambda it prepares the data so that it can be sent to InfluxDB.
  - Grafana is used to visualize the data in realtime by pulling the data from InfluxDB.
- Kinesis Firehose:
  - Before processing the data ,Firehose dumps the unprocessed data after patching it into a backup bucket.
  - Firehose process the data and transforms the data from a JSON format to a Parquet format with the help of Glue Data Catalog.
  - The processed data is patched and stored in a separate bucket.
- Transforming the data happens by defining a table schema in AWS Glue and then Firehose uses the table to transform the data to parquet.
  - If there were further transformations on the data. It would require us to use Lambda but since it's a simple transformation doing it inside firehose makes the infrastructure much simpler.
- The data is transformed to parquet in order to query it with Athena efficiently.

Visualization

Realtime Visualization with Grafana

Data is visualized in realtime by sending the data to Lambda then Lambda prepares the data and sends it to InfluxDB.
Grafana pulls the data from InfluxDB and the data is send used for visualizations.

AWS QuickSight

The parquet files in S3 is queried by Athena and visualized with QuickSight.