Data Caterer - Test Data Management Tool

Overview

A test data management tool with automated data generation, validation and clean up.

Generate data for databases, files, messaging systems or HTTP requests via UI, Scala/Java SDK or YAML input and executed via Spark. Run data validations after generating data to ensure it is consumed correctly. Clean up generated data or consumed data in downstream data sources to keep your environments tidy. Define alerts to get notified when failures occur and deep dive into issues from the generated report.

Full docs can be found here.

Scala/Java examples found here.

A demo of the UI found here.

Features

Quick start

Docker

docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.12.1

Open localhost:9898.

Run Scala/Java examples

git clone git@github.com:data-catering/data-caterer-example.git
cd data-caterer-example && ./run.sh
#check results under docker/sample/report/index.html folder

UI App: Mac download
UI App: Windows download
1. After downloading, go to 'Downloads' folder and 'Extract All' from data-caterer-windows
2. Double-click 'DataCaterer-1.0.0' to install Data Caterer
3. Click on 'More info' then at the bottom, click 'Run anyway'
4. Go to '/Program Files/DataCaterer' folder and run DataCaterer application
5. If your browser doesn't open, go to http://localhost:9898 in your preferred browser
UI App: Linux download

Integrations

Supported data sources

Data Caterer supports the below data sources. Check here for the full roadmap.

Data Source Type	Data Source	Support
Cloud Storage	AWS S3	✅
Cloud Storage	Azure Blob Storage	✅
Cloud Storage	GCP Cloud Storage	✅
Database	Cassandra	✅
Database	MySQL	✅
Database	Postgres	✅
Database	Elasticsearch	❌
Database	MongoDB	❌
File	CSV	✅
File	Delta Lake	✅
File	JSON	✅
File	Iceberg	✅
File	ORC	✅
File	Parquet	✅
File	Hudi	❌
HTTP	REST API	✅
Messaging	Kafka	✅
Messaging	Solace	✅
Messaging	ActiveMQ	❌
Messaging	Pulsar	❌
Messaging	RabbitMQ	❌
Metadata	Data Contract CLI	✅
Metadata	Great Expectations	✅
Metadata	Marquez	✅
Metadata	OpenAPI/Swagger	✅
Metadata	OpenMetadata	✅
Metadata	Open Data Contract Standard (ODCS)	✅
Metadata	Amundsen	❌
Metadata	Datahub	❌
Metadata	Solace Event Portal	❌

Sponsorship

Data Caterer is set up under a sponsorship model. If you require support or additional features from Data Caterer as an enterprise, you are required to be a sponsor for the project.

Find out more details here to help with sponsorship.

Contributing

View details here about how you can contribute to the project.

Additional Details

Run Configurations

Different ways to run Data Caterer based on your use case:

Design

Design motivations and details can be found here.

Roadmap

Can check here for full list.

Mildly Quick Start

Generate and validate data

I want to generate data in Postgres

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")  //name and url

But I want `account_id` to follow a pattern and be unique

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .schema(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

I then want to test my job ingests all the data after generating

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .schema(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(validation.count.isEqual(1000))

I want to make sure all the `account_id` values in Postgres are in the Parquet file

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .schema(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(
     validation.upstreamData(postgresTask)
       .joinColumns("account_id")
       .withValidation(validation.count().isEqual(1000))
  )

I want to start validating once the Parquet file is available

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .schema(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(
     validation.upstreamData(postgresTask)
       .joinColumns("account_id")
       .withValidation(validation.count().isEqual(1000))
  )
  .validationWait(waitCondition.file("/data/parquet/customer"))

Generate same data across data sources

I also want to generate events in Kafka

kafka("my_kafka", "localhost:29092")
  .topic("account-topic")
  .schema(...)

But I want the same `account_id` to show in Postgres and Kafka

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .schema(field.name("account_id").regex("ACC[0-9]{10}"))

val kafkaTask = kafka("my_kafka", "localhost:29092")
  .topic("account-topic")
  .schema(...)

plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(kafkaTask -> List("account_id"))
)

Generate data and clean up

I want to generate 5 transactions per `account_id`

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerColumn(5, "account_id"))

Randomly generate 1 to 5 transactions per `account_id`

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerColumnGenerator(generator.min(1).max(5), "account_id"))

I want to delete the generated data

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerColumnGenerator(generator.min(0).max(5), "account_id"))

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)

I also want to delete the data in Cassandra because my job consumed the data in Postgres and pushed to Cassandra

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerColumnGenerator(generator.min(0).max(5), "account_id"))

val cassandraTxns = cassandra("ingested_data", "localhost:9042")
  .table("account", "transactions")

val deletePlan = plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(),
   List(cassandraTxns -> List("account_id"))
)

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)

But only the `account_number` is saved in Cassandra from the `account_id`

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .count(count.recordsPerColumnGenerator(generator.min(0).max(5), "account_id"))

val cassandraTxns = cassandra("ingested_data", "localhost:9042")
  .table("account", "transactions")

val deletePlan = plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(),
   List(cassandraTxns -> List("SUBSTR(account_id, 3) AS account_number"))
)

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)

Generate data with schema from metadata source

I have a data contract using the Open Data Contract Standard (ODCS) format

parquet("customer_parquet", "/data/parquet/customer")
  .schema(metadataSource.openDataContractStandard("/data/odcs/full-example.odcs.yaml"))

I have an OpenAPI/Swagger doc

http("my_http")
  .schema(metadataSource.openApi("/data/http/petstore.json"))

Validate data using validations from metadata source

I have expectations from Great Expectations

parquet("customer_parquet", "/data/parquet/customer")
  .validations(metadataSource.greatExpectations("/data/great-expectations/taxi-expectations.json"))