rtdl makes it easy to build and maintain a real-time data lake. You send rtdl
a real-time data stream – often from a tool like Kafka or Segment – and it builds you a real-time
data lake in Parquet format that automatically works with Dremio to
give you access to your real-time data in popular BI and ML tools – just like a data warehouse.
rtdl can build your real-time data lake on AWS S3, GCP Cloud Storage, and Azure Blob Storage.
You provide the streams, rtdl builds your data lake.
Stay up-to-date on rtdl via our website and blog, and learn how to use rtdl via our documentation.
rtdl's initial feature set is built and working. You can use the API on port 80 to
configure streams that ingest json from an rtdl endpoint on port 8080, process them into Parquet,
and save the files to a destination configured in your stream. rtdl can write files locally, to
AWS S3, GCP Cloud Storage, and Azure Blob Storage and you can query your data via Dremio's web UI
at http://localhost:9047 (login with Username: rtdl
and Password rtdl1234
).
- Replaced Kafka & Zookeeper with Redpanda.
- Added support for HDFS.
- Fixed issue with handling booleans when writing Parquet.
- Added several logo variants and a banner to the public directory.
- Dremio Cloud support.
- Apache Hudi support.
- Start using GitHub Projects for work tracking.
- Research and implementation for Apache Iceberg, Delta Lake, and Project Nessie.
- Community contribution: Stateful Function for PII detection and masking.
- Graphical user interface.
For more detailed instructions, see our Initialize rtdl docs.
- Run
docker compose -f docker-compose.init.yml up -d
.- Note: This configuration should be fault-tolerant, but if any containers or
processes fail when running this, run
docker compose -f docker-compose.init.yml down
and retry.
- Note: This configuration should be fault-tolerant, but if any containers or
processes fail when running this, run
- After the containers
rtdl_rtdl-db-init
,rtdl_dremio-init
, andrtdl_redpanda-init
exit and complete withEXITED (0)
, kill and delete the rtdl container set by runningdocker compose -f docker-compose.init.yml down
. - Run
docker compose up -d
every time after.
Note: Your memory setting in Docker must be at least 8GB. rtdl may become unstable if it is set lower.docker compose down
to stop.
Note #1: To start from scratch, run rm -rf storage/
from the rtdl root folder.
Note #2: If you experience file write issues preventing Dremio and/or Redpanda services
from starting, please add user: root
to the docker-compose.init.yml
and docker-compose.yml
files in the Dremio and Redpanda service definitions. This issue has been encountered on Linux.
For more detailed setup instructions for your cloud provider, see our setup docs:
- Create a new S3 bucket.
- For more information, see Amazon’s documentation.
- Create a new IAM user.
- For more information, see Amazon’s documentation.
- Create a IAM new policy.
- Use the below permissions, and attach the policy to the IAM
user created in step 2. Replace
<YOUR_BUCKET_NAME>
with the name of the S3 bucket you created in step 1.{ "Version": "2012-10-17", "Statement": [ { "Sid": "ListAllBuckets", "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:ListAllMyBuckets" ], "Resource": [ "arn:aws:s3:::*" ] }, { "Sid": "ListBucket", "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::<YOUR_BUCKET_NAME>" ] }, { "Sid": "ManageBucket", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:PutObjectAcl", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::<YOUR_BUCKET_NAME>/*" ] } ] }
- Use the below permissions, and attach the policy to the IAM
user created in step 2. Replace
- Attach the policy created in step 3 to the IAM user created in step 2.
- Create access keys for your IAM user.
- For more information, see Amazon's documentation.
- Save the
Access Key ID
andSecret Access Key
for use in configuring your stream in rtdl.
- Create a stream configuration record in rtdl.
Send a call to the API at http://localhost:80/createStream.- Example
createStream
call body for creating a data lake on AWS S3.{ "active": true, "message_type": "test-msg-aws", "file_store_type_id": 2, "region": "us-west-1", "bucket_name": "testBucketAWS", "folder_name": "testFolderAWS", "partition_time_id": 1, "compression_type_id": 1, "aws_access_key_id": "[aws_access_key_id]", "aws_secret_access_key": "[aws_secret_access_key]" }
- Example
createStream
curl call for creating a data lake on AWS S3.curl --location --request POST 'http://localhost:80/createStream' \ --header 'Content-Type: application/json' \ --data-raw '{ "active": true, "message_type": "test-msg-aws", "file_store_type_id": 2, "region": "us-west-1", "bucket_name": "testBucketAWS", "folder_name": "testFolderAWS", "partition_time_id": 1, "compression_type_id": 1, "aws_access_key_id": "[aws_access_key_id]", "aws_secret_access_key": "[aws_secret_access_key]" }'
- Example
For more detailed instructions, see our Send data to rtdl docs.
All data should be sent to the ingest
endpoint of the ingest service on port 8080 -- e.g. http://localhost:8080/ingest.
- You can send any json with just
stream_id
in the payload and rtdl will add it to your lake.You can optionally add{ "stream_id":"837a8d07-cd06-4e17-bcd8-aef0b5e48d31", "name":"user1", "array":[1,2,3], "properties":{"age":20} }
message_type
should you choose to override themessage_type
specified while creating the stream. rtdl will default to a message typertdl_default
if message type is absent in both stream definition and actual message.
rtdl has a multi-service architecture composed of a new generation of open source tools to process and access your data and custom-built services to interact with them more easily. To learn more about rtdl's services and architecture, visit our Architecture docs.
Contributions are always welcome!
See our CONTRIBUTING for ways to get started.
This project adheres to the rtdl code of conduct - a
direct adaptation of the Contributor Covenant,
version 2.1.