This creates the playground for a modern data lake infrastructure.
- Data Storage: MinIO
- Data Query Engine: Dremio
- Data Lake Engine: Spark + Iceberg
- [Todo] Data Visualization: Metabase
To start-up the infrastructure, run
docker-compose up
Once the containers are up and running, we can access
- Jupyter Notebook: http://localhost:8888
- Spark Driver UI: http://localhost:8080
- Spark History UI: http://localhost:18080
- MinIO UI: http://localhost:9001
- Dremio UI: http://localhost:9047
- Click
Add Source
- Choose
Amazon S3
- Put the following configuration
- Inside General
- Name:
MinIO
- Authentication:
AWS Access Key
- AWS Access Key:
admin
- AWS Access Secret:
password
- IAM Role to Assume:
- AWS Access Key:
- Name:
- Inside Advanced Options
- Check the following
- Enable asynchronous access when possible
- Enable compatibility mode
- Enable file status check
- Enable partition column inference
- Under Connection Properties, set the following key-value pairs
fs.s3a.path.style.access
:true
fs.s3a.endpoint:http
:http://minio:9000
fs.s3a.connection.ssl
:false
- Check the following
- Inside General
- Click "Save".