This repository provides a Docker Compose-based implementation of the Data Platform consumption layer as described in the article Data Platform - Solution for Consumption (Part 2).
The consumption layer includes components for data persistence, storage, cataloging, and processing, which together address the challenges faced by traditional commercial Data Warehouse solutions, such as limited scalability and tightly coupled architecture as mentioned in this article Data Platform — The Challenges (Part 1).
The architecture of this Data Platform consumption layer consists of the following components:
- Permanent Memory: Object Storage (S3)
- Storage Engine: Parquet columnar format
- Catalog Manager: Hive Metastore
- Computation Engine: PrestoDB and Hive
These components are decoupled and highly scalable, providing an efficient and flexible solution for various data consumption needs.
- Docker
- Docker Compose
- Clone this repository to your local machine.
- Navigate to the root directory of the project.
- Run
docker-compose up -d
to start all the services.
The Docker Compose configuration includes the following services:
- Hive Server
- Metastore
- PostgreSQL
- LocalStack
- PrestoDB Coordinator
- PrestoDB Workers
- Hue
For detailed information on each service, please refer to the technical documentation.
Once all services are up and running, you can access the different components of the Data Platform consumption layer using their respective ports and tools:
- Hive: Connect to HiveServer2 on port 10000
- Metastore: Connect to the Hive metastore on port 9083
- PostgreSQL: Connect to the PostgreSQL database on port 5432
- LocalStack: Access S3 storage on port 4566
- PrestoDB: Access the PrestoDB coordinator on port 8081
- Hue: Access the Hue web interface on port 8888
To get a better understanding of how to use the Data Platform consumption layer, we have provided a tutorial that demonstrates its functionality that can be found here.
This tutorial will walk you through the process of creating a sample dataset, storing it in the S3 object storage, and querying the data using PrestoDB and Hive Server.
Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.