/data-caterer

Test data management tool for any data source, batch or real-time. Generate, validate and clean up data all in one tool.

Primary LanguageScalaOtherNOASSERTION

Data Caterer - Test Data Management Tool

Overview

A test data management tool with automated data generation, validation and clean up.

Basic data flow for Data Caterer

Generate data for databases, files, messaging systems or HTTP requests via UI, Scala/Java SDK or YAML input and executed via Spark. Run data validations after generating data to ensure it is consumed correctly. Clean up generated data or consumed data in downstream data sources to keep your environments tidy. Define alerts to get notified when failures occur and deep dive into issues from the generated report.

Full docs can be found here.

Scala/Java examples found here.

A demo of the UI found here.

Features

Basic flow

Quick start

  1. Docker
    docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.12.1
    Open localhost:9898.
  2. Run Scala/Java examples
    git clone git@github.com:data-catering/data-caterer-example.git
    cd data-caterer-example && ./run.sh
    #check results under docker/sample/report/index.html folder
  3. UI App: Mac download
  4. UI App: Windows download
    1. After downloading, go to 'Downloads' folder and 'Extract All' from data-caterer-windows
    2. Double-click 'DataCaterer-1.0.0' to install Data Caterer
    3. Click on 'More info' then at the bottom, click 'Run anyway'
    4. Go to '/Program Files/DataCaterer' folder and run DataCaterer application
    5. If your browser doesn't open, go to http://localhost:9898 in your preferred browser
  5. UI App: Linux download

Integrations

Supported data sources

Data Caterer supports the below data sources. Check here for the full roadmap.

Data Source Type Data Source Support
Cloud Storage AWS S3
Cloud Storage Azure Blob Storage
Cloud Storage GCP Cloud Storage
Database Cassandra
Database MySQL
Database Postgres
Database Elasticsearch
Database MongoDB
File CSV
File Delta Lake
File JSON
File Iceberg
File ORC
File Parquet
File Hudi
HTTP REST API
Messaging Kafka
Messaging Solace
Messaging ActiveMQ
Messaging Pulsar
Messaging RabbitMQ
Metadata Data Contract CLI
Metadata Great Expectations
Metadata Marquez
Metadata OpenAPI/Swagger
Metadata OpenMetadata
Metadata Open Data Contract Standard (ODCS)
Metadata Amundsen
Metadata Datahub
Metadata Solace Event Portal

Sponsorship

Data Caterer is set up under a sponsorship model. If you require support or additional features from Data Caterer as an enterprise, you are required to be a sponsor for the project.

Find out more details here to help with sponsorship.

Contributing

View details here about how you can contribute to the project.

Additional Details

Run Configurations

Different ways to run Data Caterer based on your use case:

Types of run configurations

Design

Design motivations and details can be found here.

Roadmap

Can check here for full list.

Mildly Quick Start

Generate and validate data

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")  //name and url
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .schema(field.name("account_id").regex("ACC[0-9]{10}").unique(true))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .schema(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(validation.count.isEqual(1000))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .schema(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(
     validation.upstreamData(postgresTask)
       .joinColumns("account_id")
       .withValidation(validation.count().isEqual(1000))
  )
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .schema(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(
     validation.upstreamData(postgresTask)
       .joinColumns("account_id")
       .withValidation(validation.count().isEqual(1000))
  )
  .validationWait(waitCondition.file("/data/parquet/customer"))

Generate same data across data sources

kafka("my_kafka", "localhost:29092")
  .topic("account-topic")
  .schema(...)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .schema(field.name("account_id").regex("ACC[0-9]{10}"))

val kafkaTask = kafka("my_kafka", "localhost:29092")
  .topic("account-topic")
  .schema(...)

plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(kafkaTask -> List("account_id"))
)

Generate data and clean up

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerColumn(5, "account_id"))
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerColumnGenerator(generator.min(1).max(5), "account_id"))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerColumnGenerator(generator.min(0).max(5), "account_id"))

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerColumnGenerator(generator.min(0).max(5), "account_id"))

val cassandraTxns = cassandra("ingested_data", "localhost:9042")
  .table("account", "transactions")

val deletePlan = plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(),
   List(cassandraTxns -> List("account_id"))
)

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .count(count.recordsPerColumnGenerator(generator.min(0).max(5), "account_id"))

val cassandraTxns = cassandra("ingested_data", "localhost:9042")
  .table("account", "transactions")

val deletePlan = plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(),
   List(cassandraTxns -> List("SUBSTR(account_id, 3) AS account_number"))
)

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)

Generate data with schema from metadata source

parquet("customer_parquet", "/data/parquet/customer")
  .schema(metadataSource.openDataContractStandard("/data/odcs/full-example.odcs.yaml"))
http("my_http")
  .schema(metadataSource.openApi("/data/http/petstore.json"))

Validate data using validations from metadata source

parquet("customer_parquet", "/data/parquet/customer")
  .validations(metadataSource.greatExpectations("/data/great-expectations/taxi-expectations.json"))