A test data management tool with automated data generation, validation and clean up.
Generate data for databases, files, messaging systems or HTTP requests via UI, Scala/Java SDK or YAML input and executed via Spark. Run data validations after generating data to ensure it is consumed correctly. Clean up generated data or consumed data in downstream data sources to keep your environments tidy. Define alerts to get notified when failures occur and deep dive into issues from the generated report.
Scala/Java examples found here.
- Batch and/or event data generation
- Maintain relationships across any dataset
- Create custom data generation/validation scenarios
- Data validation
- Clean up generated and downstream data
- Suggest data validations
- Metadata discovery
- Detailed report of generated data and validation results
- Alerts to be notified of results
- Run as GitHub Action
- Docker
Open localhost:9898.
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.12.1
- Run Scala/Java examples
git clone git@github.com:data-catering/data-caterer-example.git cd data-caterer-example && ./run.sh #check results under docker/sample/report/index.html folder
- UI App: Mac download
- UI App: Windows download
- After downloading, go to 'Downloads' folder and 'Extract All' from data-caterer-windows
- Double-click 'DataCaterer-1.0.0' to install Data Caterer
- Click on 'More info' then at the bottom, click 'Run anyway'
- Go to '/Program Files/DataCaterer' folder and run DataCaterer application
- If your browser doesn't open, go to http://localhost:9898 in your preferred browser
- UI App: Linux download
Data Caterer supports the below data sources. Check here for the full roadmap.
Data Source Type | Data Source | Support |
---|---|---|
Cloud Storage | AWS S3 | ✅ |
Cloud Storage | Azure Blob Storage | ✅ |
Cloud Storage | GCP Cloud Storage | ✅ |
Database | Cassandra | ✅ |
Database | MySQL | ✅ |
Database | Postgres | ✅ |
Database | Elasticsearch | ❌ |
Database | MongoDB | ❌ |
File | CSV | ✅ |
File | Delta Lake | ✅ |
File | JSON | ✅ |
File | Iceberg | ✅ |
File | ORC | ✅ |
File | Parquet | ✅ |
File | Hudi | ❌ |
HTTP | REST API | ✅ |
Messaging | Kafka | ✅ |
Messaging | Solace | ✅ |
Messaging | ActiveMQ | ❌ |
Messaging | Pulsar | ❌ |
Messaging | RabbitMQ | ❌ |
Metadata | Data Contract CLI | ✅ |
Metadata | Great Expectations | ✅ |
Metadata | Marquez | ✅ |
Metadata | OpenAPI/Swagger | ✅ |
Metadata | OpenMetadata | ✅ |
Metadata | Open Data Contract Standard (ODCS) | ✅ |
Metadata | Amundsen | ❌ |
Metadata | Datahub | ❌ |
Metadata | Solace Event Portal | ❌ |
Data Caterer is set up under a sponsorship model. If you require support or additional features from Data Caterer as an enterprise, you are required to be a sponsor for the project.
Find out more details here to help with sponsorship.
View details here about how you can contribute to the project.
Different ways to run Data Caterer based on your use case:
Design motivations and details can be found here.
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer") //name and url
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.schema(field.name("account_id").regex("ACC[0-9]{10}").unique(true))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.schema(field.name("account_id").regex("ACC[0-9]{10}").unique(true))
val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
.validation(validation.count.isEqual(1000))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.schema(field.name("account_id").regex("ACC[0-9]{10}").unique(true))
val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
.validation(
validation.upstreamData(postgresTask)
.joinColumns("account_id")
.withValidation(validation.count().isEqual(1000))
)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.schema(field.name("account_id").regex("ACC[0-9]{10}").unique(true))
val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
.validation(
validation.upstreamData(postgresTask)
.joinColumns("account_id")
.withValidation(validation.count().isEqual(1000))
)
.validationWait(waitCondition.file("/data/parquet/customer"))
kafka("my_kafka", "localhost:29092")
.topic("account-topic")
.schema(...)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.schema(field.name("account_id").regex("ACC[0-9]{10}"))
val kafkaTask = kafka("my_kafka", "localhost:29092")
.topic("account-topic")
.schema(...)
plan.addForeignKeyRelationship(
postgresTask, List("account_id"),
List(kafkaTask -> List("account_id"))
)
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.table("account", "transactions")
.count(count.recordsPerColumn(5, "account_id"))
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.table("account", "transactions")
.count(count.recordsPerColumnGenerator(generator.min(1).max(5), "account_id"))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.table("account", "transactions")
.count(count.recordsPerColumnGenerator(generator.min(0).max(5), "account_id"))
val conf = configuration
.enableDeleteGeneratedRecords(true)
.enableGenerateData(false)
I also want to delete the data in Cassandra because my job consumed the data in Postgres and pushed to Cassandra
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.table("account", "transactions")
.count(count.recordsPerColumnGenerator(generator.min(0).max(5), "account_id"))
val cassandraTxns = cassandra("ingested_data", "localhost:9042")
.table("account", "transactions")
val deletePlan = plan.addForeignKeyRelationship(
postgresTask, List("account_id"),
List(),
List(cassandraTxns -> List("account_id"))
)
val conf = configuration
.enableDeleteGeneratedRecords(true)
.enableGenerateData(false)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.count(count.recordsPerColumnGenerator(generator.min(0).max(5), "account_id"))
val cassandraTxns = cassandra("ingested_data", "localhost:9042")
.table("account", "transactions")
val deletePlan = plan.addForeignKeyRelationship(
postgresTask, List("account_id"),
List(),
List(cassandraTxns -> List("SUBSTR(account_id, 3) AS account_number"))
)
val conf = configuration
.enableDeleteGeneratedRecords(true)
.enableGenerateData(false)
parquet("customer_parquet", "/data/parquet/customer")
.schema(metadataSource.openDataContractStandard("/data/odcs/full-example.odcs.yaml"))
http("my_http")
.schema(metadataSource.openApi("/data/http/petstore.json"))
parquet("customer_parquet", "/data/parquet/customer")
.validations(metadataSource.greatExpectations("/data/great-expectations/taxi-expectations.json"))