/garudata

Data Integration Project

Primary LanguageShellMIT LicenseMIT

Garudata

Garudata is a simplified showcase of various tools coming together to build an end-to-end data platform.

It is designed to streamline data ingestion, transformation, access, and sharing, allowing data users to easily understand data throughout its journey.

Demo

Visit https://71182141.xyz/ and check out how the workflow management and dashboard works.

Technology

The data platform will be built on top of the followings:

All tools (except for Nginx) will be deployed in containers. Host OS is Ubuntu Server 22.04.

Notice of change

It seems that Apache Superset does not support non-aggregated value in the metrics (#5570, #19182). As this is a feature that is necessary to support the weather data project, it looks like the business intelligence tool will need to be replaced.

I am currently exploring Metabase as the replacement tool.

Usage

Requirements

  1. Install Docker and Compose
  2. Setup Docker network to connect and share the network among various containers. In this project, garudanet in 10.10.17.0/24 is used:
    docker network create -d bridge --subnet 10.10.17.0/24 --gateway 10.10.17.1 garudanet
    

Roadmap

The list is not exhaustive and may change along the way:

  • Design end-to-end data platform architecture
  • Setup the server and the components
  • Setup Apache Spark
  • Develop a data journey use case (Note: Refer to Merpati project)
  • Design data model (Note: Refer to Merpati project)
  • Develop data extraction script (Note: Refer to Merpati project)
  • Deploy workflow using Airflow (Note: Refer to Jalak project)
  • Design simple dashboards
  • Manage metadata
  • Other improvements along the way

License

The data platform is a self-learning project, shared under MIT License.

All included applications follow their respective licenses.