Garudata is a simplified showcase of various tools coming together to build an end-to-end data platform.
It is designed to streamline data ingestion, transformation, access, and sharing, allowing data users to easily understand data throughout its journey.
Visit https://71182141.xyz/ and check out how the workflow management and dashboard works.
The data platform will be built on top of the followings:
All tools (except for Nginx) will be deployed in containers. Host OS is Ubuntu Server 22.04.
It seems that Apache Superset does not support non-aggregated value in the metrics (#5570, #19182). As this is a feature that is necessary to support the weather data project, it looks like the business intelligence tool will need to be replaced.
I am currently exploring Metabase as the replacement tool.
- Install Docker and Compose
- Setup Docker network to connect and share the network among various containers. In this project,
garudanet
in10.10.17.0/24
is used:docker network create -d bridge --subnet 10.10.17.0/24 --gateway 10.10.17.1 garudanet
The list is not exhaustive and may change along the way:
- Design end-to-end data platform architecture
- Setup the server and the components
- Setup Apache Spark
- Develop a data journey use case (Note: Refer to Merpati project)
- Design data model (Note: Refer to Merpati project)
- Develop data extraction script (Note: Refer to Merpati project)
- Deploy workflow using Airflow (Note: Refer to Jalak project)
- Design simple dashboards
- Manage metadata
- Other improvements along the way
The data platform is a self-learning project, shared under MIT License.
All included applications follow their respective licenses.