/rtdl

rtdl makes it easy to build and maintain a real-time data lake

Primary LanguageGoMIT LicenseMIT

rtdl - The Real-Time Data Lake ⚡️


MIT License
rtdl makes it easy to build and maintain a real-time data lake. You send rtdl a real-time data stream – often from a tool like Kafka or Segment – and it builds you a real-time data lake in Parquet format that automatically works with Dremio to give you access to your real-time data in popular BI and ML tools – just like a data warehouse. rtdl can build your real-time data lake on AWS S3, GCP Cloud Storage, and Azure Blob Storage.

You provide the streams, rtdl builds your data lake.

Stay up-to-date on rtdl via our website and blog, and learn how to use rtdl via our documentation.

V0.1.1 - Current status -- what works and what doesn't

What works? 🚀

rtdl's initial feature set is built and working. You can use the API on port 80 to configure streams that ingest json from an rtdl endpoint on port 8080, process them into Parquet, and save the files to a destination configured in your stream. rtdl can write files locally, to AWS S3, GCP Cloud Storage, and Azure Blob Storage and you can query your data via Dremio's web UI at http://localhost:9047 (login with Username: rtdl and Password rtdl1234).

What's new? 💥

  • Replaced Kafka & Zookeeper with Redpanda.
  • Added support for HDFS.
  • Fixed issue with handling booleans when writing Parquet.
  • Added several logo variants and a banner to the public directory.

What doesn't work/what's next on the roadmap? 🚴🏼

  • Dremio Cloud support.
  • Apache Hudi support.
  • Start using GitHub Projects for work tracking.
  • Research and implementation for Apache Iceberg, Delta Lake, and Project Nessie.
  • Community contribution: Stateful Function for PII detection and masking.
  • Graphical user interface.

Quickstart 🌱

Initialize and start rtdl

For more detailed instructions, see our Initialize rtdl docs.

  1. Run docker compose -f docker-compose.init.yml up -d.
    • Note: This configuration should be fault-tolerant, but if any containers or processes fail when running this, run docker compose -f docker-compose.init.yml down and retry.
  2. After the containers rtdl_rtdl-db-init, rtdl_dremio-init, and rtdl_redpanda-init exit and complete with EXITED (0), kill and delete the rtdl container set by running docker compose -f docker-compose.init.yml down.
  3. Run docker compose up -d every time after.
    Note: Your memory setting in Docker must be at least 8GB. rtdl may become unstable if it is set lower.
    • docker compose down to stop.

Note #1: To start from scratch, run rm -rf storage/ from the rtdl root folder.
Note #2: If you experience file write issues preventing Dremio and/or Redpanda services from starting, please add user: root to the docker-compose.init.yml and docker-compose.yml files in the Dremio and Redpanda service definitions. This issue has been encountered on Linux.

Setup your storage buckets (in AWS) and stream in rtdl

For more detailed setup instructions for your cloud provider, see our setup docs:

  1. Create a new S3 bucket.
  2. Create a new IAM user.
  3. Create a IAM new policy.
    • Use the below permissions, and attach the policy to the IAM user created in step 2. Replace <YOUR_BUCKET_NAME> with the name of the S3 bucket you created in step 1.
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "ListAllBuckets",
                  "Effect": "Allow",
                  "Action": [
                      "s3:GetBucketLocation",
                      "s3:ListAllMyBuckets"
                  ],
                  "Resource": [
                      "arn:aws:s3:::*"
                  ]
              },
              {
                  "Sid": "ListBucket",
                  "Effect": "Allow",
                  "Action": [
                      "s3:ListBucket"
                  ],
                  "Resource": [
                      "arn:aws:s3:::<YOUR_BUCKET_NAME>"
                  ]
              },
              {
                  "Sid": "ManageBucket",
                  "Effect": "Allow",
                  "Action": [
                      "s3:GetObject",
                      "s3:PutObject",
                      "s3:PutObjectAcl",
                      "s3:DeleteObject"
                  ],
                  "Resource": [
                      "arn:aws:s3:::<YOUR_BUCKET_NAME>/*"
                  ]
              }
          ]
      }
      
  4. Attach the policy created in step 3 to the IAM user created in step 2.
  5. Create access keys for your IAM user.
    • For more information, see Amazon's documentation.
    • Save the Access Key ID and Secret Access Key for use in configuring your stream in rtdl.
  6. Create a stream configuration record in rtdl.
    Send a call to the API at http://localhost:80/createStream.
    • Example createStream call body for creating a data lake on AWS S3.
      {
      "active": true,
      "message_type": "test-msg-aws",
      "file_store_type_id": 2,
      "region": "us-west-1",
      "bucket_name": "testBucketAWS",
      "folder_name": "testFolderAWS",
      "partition_time_id": 1,
      "compression_type_id": 1,
      "aws_access_key_id": "[aws_access_key_id]",
      "aws_secret_access_key": "[aws_secret_access_key]"
      }
      
    • Example createStream curl call for creating a data lake on AWS S3.
      curl --location --request POST 'http://localhost:80/createStream' \
      --header 'Content-Type: application/json' \
      --data-raw '{
      "active": true,
      "message_type": "test-msg-aws",
      "file_store_type_id": 2,
      "region": "us-west-1",
      "bucket_name": "testBucketAWS",
      "folder_name": "testFolderAWS",
      "partition_time_id": 1,
      "compression_type_id": 1,
      "aws_access_key_id": "[aws_access_key_id]",
      "aws_secret_access_key": "[aws_secret_access_key]"
      }'
      
    Note: A Postman collection with examples of all rtdl API calls can be found on GitHub at realtimedatalake/postman-rtdl-public.

Send data to rtdl

For more detailed instructions, see our Send data to rtdl docs.
All data should be sent to the ingest endpoint of the ingest service on port 8080 -- e.g. http://localhost:8080/ingest.

  • You can send any json with just stream_id in the payload and rtdl will add it to your lake.
    {
        "stream_id":"837a8d07-cd06-4e17-bcd8-aef0b5e48d31",
        "name":"user1",
        "array":[1,2,3],
        "properties":{"age":20}
    }
    
    You can optionally add message_type should you choose to override the message_type specified while creating the stream. rtdl will default to a message type rtdl_default if message type is absent in both stream definition and actual message.

Architecture 🏛

rtdl has a multi-service architecture composed of a new generation of open source tools to process and access your data and custom-built services to interact with them more easily. To learn more about rtdl's services and architecture, visit our Architecture docs.

License 🤝

MIT

Contributing 😎

Contributions are always welcome!
See our CONTRIBUTING for ways to get started. This project adheres to the rtdl code of conduct - a direct adaptation of the Contributor Covenant, version 2.1.

Appreciation 🙏

Authors ✍🏽