This is a demonstration on how to extract the public data across the world, feed it into data pipelines, and perform ETL/ELT as shown in each section diagram.
- AWS
- Terraform
- S3
- Apache Kafka
- Apache Airflow
- Apache Spark
- Amazon Redshift
- Athena/Presto
- PostgreSQL
- ElasticSearch
- Docker
According to GDELT website, GDELT dataset is one of the largest and most ambitious platforms ever created for monitoring our global world. From realtime translation of the world’s news in 65 languages, to measurement of more than 2,300 emotions and themes from every article, to a massive inventory of the media of the non-Western world.
Many organizations have been using GDELT as the complementary datasets to enchance more new signals for their machine learning models. For example, stock prices prediction, and predicting community engagement
In this project, I've downloaded both 15 minutes update "events" and "mentions" directly from GDELT and also scheduled scripts to download every 15 minutes automatically before uploading to S3 data lake so that Spectrum and Athena are able to query on top of S3.
You can see the actual code and read more information at:
Besides, AWS has been uploading GDELT events to AWS S3 registry everyday. Thus, we don't have to create scraping scripts to download the historical GDELT events by ourselves. I decided to use Apache Airflow as a glue between my custom code and AWS services.
You can see the actual code and read more information at:
Back then, yahoo used to provide the Yahoo finance API for getting the market data. Unfortunately, it has been deprecated and cannot access anymore. So, I decided to write a little script that scrapes the Yahoo finance and ingests data into PostgreSQL for us.
You can see the actual code and read more information at:
Read replica is not required. You can execute it with
terraform apply -target module.yahoo_db_replica
for better performance.
The Twitter API platform offers the way to stream realtime tweets which is nice because we can now capture the people sentiment in real time.
You can see the actual code and read more information at: