/gdelt

A project that extracts the public data for example: Twitter, GDELT, Yahoo finance, feeds them into data pipelines, and perform ETL/ELT on AWS cloud using several techniques like batch and stream processing.

Primary LanguageHCL

GDELT, Yahoo and Twitter stream

This is a demonstration on how to extract the public data across the world, feed it into data pipelines, and perform ETL/ELT as shown in each section diagram.

Environments

  • AWS
  • Terraform
  • S3
  • Apache Kafka
  • Apache Airflow
  • Apache Spark
  • Amazon Redshift
  • Athena/Presto
  • PostgreSQL
  • ElasticSearch
  • Docker

Set up environments

GDELT

According to GDELT website, GDELT dataset is one of the largest and most ambitious platforms ever created for monitoring our global world. From realtime translation of the world’s news in 65 languages, to measurement of more than 2,300 emotions and themes from every article, to a massive inventory of the media of the non-Western world.

GDELT video - Click to Watch!

Many organizations have been using GDELT as the complementary datasets to enchance more new signals for their machine learning models. For example, stock prices prediction, and predicting community engagement

In this project, I've downloaded both 15 minutes update "events" and "mentions" directly from GDELT and also scheduled scripts to download every 15 minutes automatically before uploading to S3 data lake so that Spectrum and Athena are able to query on top of S3.

You can see the actual code and read more information at:

Besides, AWS has been uploading GDELT events to AWS S3 registry everyday. Thus, we don't have to create scraping scripts to download the historical GDELT events by ourselves. I decided to use Apache Airflow as a glue between my custom code and AWS services.

You can see the actual code and read more information at:

Yahoo finance

Back then, yahoo used to provide the Yahoo finance API for getting the market data. Unfortunately, it has been deprecated and cannot access anymore. So, I decided to write a little script that scrapes the Yahoo finance and ingests data into PostgreSQL for us.

yahoo-screenshot

You can see the actual code and read more information at:

Read replica is not required. You can execute it with terraform apply -target module.yahoo_db_replica for better performance.

Twitter real time stream

The Twitter API platform offers the way to stream realtime tweets which is nice because we can now capture the people sentiment in real time.

You can see the actual code and read more information at: