Data contracts bring data providers and data consumers together.
A data contract is a document that defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and their consumers. Think of an API, but for data. A data contract is implemented by a data product or other data technologies, even legacy data warehouses. Data contracts can also be used for the input port to specify the expectations of data dependencies and verify given guarantees.
The data contract specification defines a YAML format to describe attributes of provided data sets. It is data platform neutral and can be used with any data platform, such as AWS S3, Google BigQuery, Azure, Databricks, and Snowflake. The data contract specification is an open initiative to define a common data contract format. It follows OpenAPI and AsyncAPI conventions.
Data contracts come into play when data is exchanged between different teams or organizational units, such as in a data mesh architecture. First, and foremost, data contracts are a communication tool to express a common understanding of how data should be structured and interpreted. They make semantic and quality expectations explicit. They are often created collaboratively in workshops together with data providers and data consumers. Later in development and production, they also serve as the basis for code generation, testing, schema validations, quality checks, monitoring, access control, and computational governance policies.
The specification comes along with the Data Contract CLI, an open-source tool to develop, validate, and enforce data contracts.
Note: The term "data contract" refers to a specification that is usually owned by the data provider and thus does not align with a "contract" in a legal sense as a mutual agreement between two parties. The term "contract" may be somewhat misleading, but it is how it is used by the industry. The mutual agreement between one data provider and one data consumer is the "data usage agreement" that refers to a data contract. Data usage agreements have a defined lifecycle, start/end date, and help the data provider to track who accesses their data and for which purposes.
0.9.3 (Changelog)
dataContractSpecification: 0.9.3
id: urn:datacontract:checkout:orders-latest
info:
title: Orders Latest
version: 1.0.0
description: |
Successful customer orders in the webshop.
All orders since 2020-01-01.
Orders with their line items are in their current state (no history included).
owner: Checkout Team
slackChannel: "#checkout"
contact:
name: John Doe (Data Product Owner)
url: https://teams.microsoft.com/l/channel/example/checkout
tags:
- checkout
- orders
- s3
links:
datacontractCli: https://cli.datacontract.com
servers:
production:
type: s3
environment: prod
location: s3://datacontract-example-orders-latest/data/{model}/*.json
format: json
delimiter: new_line
description: "One folder per model. One file per day."
terms:
usage: |
Data can be used for reports, analytics and machine learning use cases.
Order may be linked and joined by other tables
limitations: |
Not suitable for real-time use cases.
Data may not be used to identify individual customers.
Max data processing per day: 10 TiB
billing: 5000 USD per month
noticePeriod: P3M
models:
orders:
description: One record per order. Includes cancelled and deleted orders.
type: table
fields:
order_id:
$ref: '#/definitions/order_id'
required: true
unique: true
primary: true
order_timestamp:
description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful.
type: timestamp
required: true
example: "2024-09-09T08:30:00Z"
order_total:
description: Total amount the smallest monetary unit (e.g., cents).
type: long
required: true
example: "9999"
customer_id:
description: Unique identifier for the customer.
type: text
minLength: 10
maxLength: 20
customer_email_address:
description: The email address, as entered by the customer. The email address was not verified.
type: text
format: email
required: true
pii: true
classification: sensitive
processed_timestamp:
description: The timestamp when the record was processed by the data platform.
type: timestamp
required: true
config:
jsonType: string
jsonFormat: date-time
line_items:
description: A single article that is part of an order.
type: table
fields:
lines_item_id:
type: text
description: Primary key of the lines_item_id table
required: true
unique: true
primary: true
order_id:
$ref: '#/definitions/order_id'
references: orders.order_id
sku:
description: The purchased article number
$ref: '#/definitions/sku'
definitions:
order_id:
domain: checkout
name: order_id
title: Order ID
type: text
format: uuid
description: An internal ID that identifies an order in the online shop.
example: 243c25e5-a081-43a9-aeab-6d5d5b6cb5e2
pii: true
classification: restricted
tags:
- orders
sku:
domain: inventory
name: sku
title: Stock Keeping Unit
type: text
pattern: ^[A-Za-z0-9]{8,14}$
example: "96385074"
description: |
A Stock Keeping Unit (SKU) is an internal unique identifier for an article.
It is typically associated with an article's barcode, such as the EAN/GTIN.
links:
wikipedia: https://en.wikipedia.org/wiki/Stock_keeping_unit
tags:
- inventory
examples:
- type: csv # csv, json, yaml, custom
model: orders
description: An example list of order records.
data: | # expressed as string or inline yaml or via "$ref: data.csv"
order_id,order_timestamp,order_total,customer_id,customer_email_address,processed_timestamp
"1001","2030-09-09T08:30:00Z",2500,"1000000001","mary.taylor82@example.com","2030-09-09T08:31:00Z"
"1002","2030-09-08T15:45:00Z",1800,"1000000002","michael.miller83@example.com","2030-09-09T08:31:00Z"
"1003","2030-09-07T12:15:00Z",3200,"1000000003","michael.smith5@example.com","2030-09-09T08:31:00Z"
"1004","2030-09-06T19:20:00Z",1500,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z"
"1005","2030-09-05T10:10:00Z",4200,"1000000004","elizabeth.moore80@example.com","2030-09-09T08:31:00Z"
"1006","2030-09-04T14:55:00Z",2800,"1000000005","john.davis28@example.com","2030-09-09T08:31:00Z"
"1007","2030-09-03T21:05:00Z",1900,"1000000006","linda.brown67@example.com","2030-09-09T08:31:00Z"
"1008","2030-09-02T17:40:00Z",3600,"1000000007","patricia.smith40@example.com","2030-09-09T08:31:00Z"
"1009","2030-09-01T09:25:00Z",3100,"1000000008","linda.wilson43@example.com","2030-09-09T08:31:00Z"
"1010","2030-08-31T22:50:00Z",2700,"1000000009","mary.smith98@example.com","2030-09-09T08:31:00Z"
- type: csv
model: line_items
description: An example list of line items.
data: |
lines_item_id,order_id,sku
"LI-1","1001","5901234123457"
"LI-2","1001","4001234567890"
"LI-3","1002","5901234123457"
"LI-4","1002","2001234567893"
"LI-5","1003","4001234567890"
"LI-6","1003","5001234567892"
"LI-7","1004","5901234123457"
"LI-8","1005","2001234567893"
"LI-9","1005","5001234567892"
"LI-10","1005","6001234567891"
servicelevels:
availability:
description: The server is available during support hours
percentage: 99.9%
retention:
description: Data is retained for one year
period: P1Y
unlimited: false
latency:
description: Data is available within 25 hours after the order was placed
threshold: 25h
sourceTimestampField: orders.order_timestamp
processedTimestampField: orders.processed_timestamp
freshness:
description: The age of the youngest row in a table.
threshold: 25h
timestampField: orders.order_timestamp
frequency:
description: Data is delivered once a day
type: batch # or streaming
interval: daily # for batch, either or cron
cron: 0 0 * * * # for batch, either or interval
support:
description: The data is available during typical business hours at headquarters
time: 9am to 5pm in EST on business days
responseTime: 1h
backup:
description: Data is backed up once a week, every Sunday at 0:00 UTC.
interval: weekly
cron: 0 0 * * 0
recoveryTime: 24 hours
recoveryPoint: 1 week
quality:
type: SodaCL # data quality check format: SodaCL, montecarlo, custom
specification: # expressed as string or inline yaml or via "$ref: checks.yaml"
checks for orders:
- row_count >= 5
- duplicate_count(order_id) = 0
checks for line_items:
- values in (order_id) must exist in orders (order_id)
- row_count >= 5
The Data Contract CLI is a command line tool and Python library to lint, test, import and export data contracts.
Here is short example how to verify that your actual dataset matches the data contract:
pip3 install datacontract-cli
datacontract test https://datacontract.com/examples/orders-latest/datacontract.yaml
or, if you prefer Docker:
docker run datacontract/cli test https://datacontract.com/examples/orders-latest/datacontract.yaml
The Data Contract contains all required information to verify data:
- The servers block has the connection details to the actual data set.
- The models define the syntax, formats, and constraints.
- The quality defined further quality checks.
The Data Contract CLI chooses the appropriate engine, formulates test cases, connects to the server, and executes the tests, based on the server type.
More information and configuration options on cli.datacontract.com.
- Data Contract Object
- Info Object
- Contact Object
- Server Object
- Terms Object
- Model Object
- Field Object
- Definition Object
- Schema Object
- Example Object
- Service Level Object
- Quality Object
- Data Types
- Specification Extensions
JSON Schema of the Data Contract Specification.
This is the root document.
It is RECOMMENDED that the root document be named: datacontract.yaml
.
Field | Type | Description |
---|---|---|
dataContractSpecification | string |
REQUIRED. Specifies the Data Contract Specification being used. |
id | string |
REQUIRED. An organization-wide unique technical identifier, such as a UUID, URN, slug, string, or number |
info | Info Object | REQUIRED. Specifies the metadata of the data contract. May be used by tooling. |
servers | Map[string , Server Object] |
Specifies the servers of the data contract. |
terms | Terms Object | Specifies the terms and conditions of the data contract. |
models | Map[string , Model Object] |
Specifies the logical data model. |
definitions | Map[string , Definition Object] |
Specifies definitions. |
schema | Schema Object | Specifies the physical schema. The specification supports different schema format. |
examples | Array of Example Objects | Specifies example data sets for the data model. The specification supports different example types. |
servicelevels | Service Levels Object | Specifies the service level of the provided data |
quality | Quality Object | Specifies the quality attributes and checks. The specification supports different quality check DSLs. |
links | Map[string , string ] |
Additional external documentation links. |
tags | Array of string |
Custom metadata to provide additional context. |
This object MAY be extended with Specification Extensions.
Metadata and life cycle information about the data contract.
Field | Type | Description |
---|---|---|
title | string |
REQUIRED. The title of the data contract. |
version | string |
REQUIRED. The version of the data contract document (which is distinct from the Data Contract Specification version or the Data Product implementation version). |
status | string |
The status of the data contract. Can be proposed , in development , active , deprecated , retired . |
description | string |
A description of the data contract. |
owner | string |
The owner or team responsible for managing the data contract and providing the data. |
contact | Contact Object | Contact information for the data contract. |
This object MAY be extended with Specification Extensions.
Contact information for the data contract.
Field | Type | Description |
---|---|---|
name | string |
The identifying name of the contact person/organization. |
url | string |
The URL pointing to the contact information. This MUST be in the form of a URL. |
string |
The email address of the contact person/organization. This MUST be in the form of an email address. |
This object MAY be extended with Specification Extensions.
The fields are dependent on the defined type.
Field | Type | Description |
---|---|---|
type | string |
REQUIRED. The type of the data product technology that implements the data contract. Well-known server types are: bigquery , s3 , glue , redshift , azure , sqlserver , snowflake , databricks , postgres , oracle , kafka , pubsub , sftp , kinesis , trino , local |
description | string |
An optional string describing the server. |
environment | string |
An optional string describing the environment, e.g., prod, sit, stg. |
This object MAY be extended with Specification Extensions.
Field | Type | Description |
---|---|---|
type | string |
bigquery |
project | string |
The GCP project name. |
dataset | string |
Field | Type | Description |
---|---|---|
type | string |
s3 |
location | string |
S3 URL, starting with s3:// |
endpointUrl | string |
The server endpoint for S3-compatible servers, such as https://minio.example.com |
format | string |
Format of files, such as parquet , delta , json , csv |
delimiter | string |
(Only for format = json ), how multiple json documents are delimited within one file, e.g., new_line , array |
Example:
servers:
production:
type: s3
location: s3://acme-orders-prod/orders/
Field | Type | Description |
---|---|---|
type | string |
glue |
account | string |
REQUIRED. The AWS account, e.g., 1234-5678-9012 |
database | string |
REQUIRED. The AWS Glue Catalog database |
location | string |
URI location of the Glue Database |
format | string |
Format of files, such as parquet , delta , json , csv |
Example:
servers:
production:
type: glue
account: "1234-5678-9012"
database: acme-orders
location: s3://acme-orders-prod/orders/
format: parquet
Field | Type | Description |
---|---|---|
type | string |
redshift |
account | string |
|
database | string |
|
schema | string |
Field | Type | Description |
---|---|---|
type | string |
azure |
location | string |
Fully qualified path to Azure Blob Storage or Azure Data Lake Storage (ADLS), supports globs. Starting with az:// or abfss Examples: az://my_storage_account_name.blob.core.windows.net/my_container/path/*.parquet or abfss://my_storage_account_name.dfs.core.windows.net/my_container_name/path/*.parquet |
format | string |
Format of files, such as parquet , json , csv |
delimiter | string |
(Only for format = json ), how multiple json documents are delimited within one file, e.g., new_line , array |
Field | Type | Description |
---|---|---|
type | string |
sqlserver |
host | string |
The host to the database server |
port | integer |
The port to the database server, default: 1433 |
database | string |
The name of the database, e.g., database . |
schema | string |
The name of the schema in the database, e.g., dbo . |
driver | string |
The name of the supported driver, e.g., ODBC Driver 18 for SQL Server . |
Field | Type | Description |
---|---|---|
type | string |
snowflake |
account | string |
|
database | string |
|
schema | string |
Field | Type | Description |
---|---|---|
type | string |
databricks |
host | string |
The Databricks host, e.g., dbc-abcdefgh-1234.cloud.databricks.com |
catalog | string |
The name of the Hive or Unity catalog |
schema | string |
The schema name in the catalog |
Field | Type | Description |
---|---|---|
type | string |
postgres |
host | string |
The host to the database server |
port | integer |
The port to the database server |
database | string |
The name of the database, e.g., postgres . |
schema | string |
The name of the schema in the database, e.g., public . |
Field | Type | Description |
---|---|---|
type | string |
oracle |
host | string |
The host to the oracle server |
port | integer |
The port to the oracle server |
serviceName | string |
The name of the service |
Field | Type | Description |
---|---|---|
type | string |
kafka |
host | string |
The bootstrap server of the kafka cluster. |
topic | string |
The topic name. |
format | string |
The format of the message. Examples: json, avro, protobuf. Default: json. |
Field | Type | Description |
---|---|---|
type | string |
pubsub |
project | string |
The GCP project name. |
topic | string |
The topic name. |
Field | Type | Description |
---|---|---|
type | string |
sftp |
location | string |
S3 URL, starting with sftp:// |
format | string |
Format of files, such as parquet , delta , json , csv |
delimiter | string |
(Only for format = json ), how multiple json documents are delimited within one file, e.g., new_line , array |
Field | Type | Description |
---|---|---|
type | string |
kinesis |
stream | string |
The name of the Kinesis data stream. |
region | string |
AWS region, e.g., eu-west-1 . |
format | string |
The format of the records. Examples: json, avro, protobuf. |
Field | Type | Description |
---|---|---|
type | string |
trino |
host | string |
The Trino host |
port | integer |
The Trino port |
catalog | string |
The name of the catalog, e.g., my_catalog . |
schema | string |
The name of the schema in the catalog, e.g., my_schema . |
Field | Type | Description |
---|---|---|
type | string |
local |
path | string |
The relative or absolute path to the data file(s), such as ./folder/data.parquet . |
format | string |
The format of the file(s), such as parquet , delta , csv , or json . |
The terms and conditions of the data contract.
Field | Type | Description |
---|---|---|
usage | string |
The usage describes the way the data is expected to be used. Can contain business and technical information. |
limitations | string |
The limitations describe the restrictions on how the data can be used, can be technical or restrictions on what the data may not be used for. |
billing | string |
The billing describes the pricing model for using the data, such as whether it's free, having a monthly fee, or metered pay-per-use. |
noticePeriod | string |
The period of time that must be given by either party to terminate or modify a data usage agreement. Uses ISO-8601 period format, e.g., P3M for a period of three months. |
This object MAY be extended with Specification Extensions.
The Model Object describes the structure and semantics of a data model, such as tables, views, or structured files.
The name of the data model (table name) is defined by the key that refers to this Model Object.
Field | Type | Description |
---|---|---|
type | string |
The type of the model. Examples: table , view , object . Default: table . |
description | string |
An optional string describing the data model. |
title | string |
An optional string for the title of the data model. Especially useful if the name of the model is cryptic or contains abbreviations. |
fields | Map[string , Field Object] |
The fields (e.g. columns) of the data model. |
config | Config Object | Any additional key-value pairs that might be useful for further tooling. |
This object MAY be extended with Specification Extensions.
The Field Objects describes one field (column, property, nested field) of a data model.
Field | Type | Description |
---|---|---|
description | string |
An optional string describing the semantic of the data in this field. |
type | Data Type | The logical data type of the field. |
title | string |
An optional string providing a human readable name for the field. Especially useful if the field name is cryptic or contains abbreviations. |
enum | array of string |
A value must be equal to one of the elements in this array value. Only evaluated if the value is not null. |
required | boolean |
An indication, if this field must contain a value and may not be null. Default: false |
primary | boolean |
If this field is a primary key. Default: false |
references | string |
The reference to a field in another model. E.g. use 'orders.order_id' to reference the order_id field of the model orders. Think of defining a foreign key relationship. |
unique | boolean |
An indication, if the value must be unique within the model. Default: false |
format | string |
email : A value must be complaint to RFC 5321, section 4.1.2.uri : A value must be complaint to RFC 3986.uuid : A value must be complaint to RFC 4122. Only evaluated if the value is not null. Only applies to unicode character sequences types (string , text , varchar ). |
precision | number |
The maximum number of digits in a number. Only applies to numeric values. Defaults to 38. |
scale | number |
The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0. |
minLength | number |
A value must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (string , text , varchar ). |
maxLength | number |
A value must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (string , text , varchar ). |
pattern | string |
A value must be valid according to the ECMA-262 regular expression dialect. Only evaluated if the value is not null. Only applies to unicode character sequences types (string , text , varchar ). |
minimum | number |
A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. |
exclusiveMinimum | number |
A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values. |
maximum | number |
A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. |
exclusiveMaximum | number |
A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values. |
example | string |
An example value. |
pii | boolean |
An indication, if this field contains Personal Identifiable Information (PII). |
classification | string |
The data class defining the sensitivity level for this field, according to the organization's classification scheme. Examples may be: sensitive , restricted , internal , public . |
tags | Array of string |
Custom metadata to provide additional context. |
links | Map[string ,string ] |
Additional external documentation links. |
$ref | string |
A reference URI to a definition in the specification, internally or externally. Properties will be inherited from the definition. |
fields | Map[string , Field Object] |
The nested fields (e.g. columns) of the object, record, or struct. Use only when type is object , record , or struct . |
items | Field Object | The type of the elements in the array. Use only when type is array . |
keys | Field Object | Describes the key structure of a map. Defaults to type: string if a map is defined as type. Not all server types support different key types. Use only when type is map . |
values | Field Object | Describes the value structure of a map. Use only when type is map . |
config | Config Object | Any additional key-value pairs that might be useful for further tooling. |
This object MAY be extended with Specification Extensions.
The Definition Object includes a clear and concise explanations of syntax, semantic, and classification of a business object in a given domain.
It serves as a reference for a common understanding of terminology, ensure consistent usage and to identify join-able fields.
Models fields can refer to definitions using the $ref
field to link to existing definitions and avoid duplicate documentations.
Field | Type | Description |
---|---|---|
name | string |
REQUIRED. The technical name of this definition. |
type | Data Type | REQUIRED. The logical data type |
domain | string |
The domain in which this definition is valid. Default: global . |
title | string |
The business name of this definition. |
description | string |
Clear and concise explanations related to the domain |
enum | array of string |
A value must be equal to one of the elements in this array value. Only evaluated if the value is not null. |
format | string |
email : A value must be complaint to RFC 5321, section 4.1.2.uri : A value must be complaint to RFC 3986.uuid : A value must be complaint to RFC 4122. Only evaluated if the value is not null. Only applies to unicode character sequences types (string , text , varchar ). |
precision | number |
The maximum number of digits in a number. Only applies to numeric values. Defaults to 38. |
scale | number |
The maximum number of decimal places in a number. Only applies to numeric values. Defaults to 0. |
minLength | number |
A value must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (string , text , varchar ). |
maxLength | number |
A value must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to unicode character sequences types (string , text , varchar ). |
pattern | string |
A value must be valid according to the ECMA-262 regular expression dialect. Only evaluated if the value is not null. Only applies to unicode character sequences types (string , text , varchar ). |
minimum | number |
A value of a number must greater than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. |
exclusiveMinimum | number |
A value of a number must greater than the value of this. Only evaluated if the value is not null. Only applies to numeric values. |
maximum | number |
A value of a number must less than, or equal to, the value of this. Only evaluated if the value is not null. Only applies to numeric values. |
exclusiveMaximum | number |
A value of a number must less than the value of this. Only evaluated if the value is not null. Only applies to numeric values. |
example | string |
An example value. |
pii | boolean |
An indication, if this field contains Personal Identifiable Information (PII). |
classification | string |
The data class defining the sensitivity level for this field, according to the organization's classification scheme. |
tags | Array of string |
Custom metadata to provide additional context. |
links | Map[string , string ] |
Additional external documentation links. |
fields | Map[string , Field Object] |
The nested fields (e.g. columns) of the object, record, or struct. Use only when type is object , record , or struct . |
items | Field Object | The type of the elements in the array. Use only when type is array . |
keys | Field Object | Describes the key structure of a map. Defaults to type: string if a map is defined as type. Not all server types support different key types. Use only when type is map . |
values | Field Object | Describes the value structure of a map. Use only when type is map . |
This object MAY be extended with Specification Extensions.
The schema of the data contract describes the physical schema. The type of the schema depends on the data platform.
Field | Type | Description |
---|---|---|
type | string |
REQUIRED. The type of the schema. Typical values are: dbt , bigquery , json-schema , sql-ddl , avro , protobuf , custom |
specification | dbt Schema Object | BigQuery Schema Object | JSON Schema Schema Object | SQL DDL Schema Object | string |
REQUIRED. The specification of the schema. The schema specification can be encoded as a string or as inline YAML. |
https://docs.getdbt.com/reference/model-properties
Example (inline YAML):
schema:
type: dbt
specification:
version: 2
models:
- name: "My Table"
description: "My description"
columns:
- name: "My column"
data_type: text
description: "My description"
Example (string):
schema:
type: dbt
specification: |-
version: 2
models:
- name: "My Table"
description: "My description"
columns:
- name: "My column"
data_type: text
description: "My description"
The schema structure is defined by the Google BigQuery Table object. You can extract such a Table object via the tables.get endpoint.
Instead of providing a single Table object, you can also provide an array of such objects. Be aware that tables.list only returns a subset of the full Table object. You need to call every Table object via tables.get to get the full Table object, including the actual schema.
Learn more: Google BigQuery REST Reference v2
Example:
schema:
type: bigquery
specification: |-
{
"tableReference": {
"projectId": "my-project",
"datasetId": "my_dataset",
"tableId": "my_table"
},
"description": "This is a description",
"type": "TABLE",
"schema": {
"fields": [
{
"name": "name",
"type": "STRING",
"mode": "NULLABLE",
"description": "This is a description"
}
]
}
}
JSON Schema can be defined as JSON or rendered as YAML, following the OpenAPI Schema Object dialect
Example (inline YAML):
schema:
type: json-schema
specification:
orders:
description: One record per order. Includes cancelled and deleted orders.
type: object
properties:
order_id:
type: string
description: Primary key of the orders table
order_timestamp:
type: string
format: date-time
description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful.
order_total:
type: integer
description: Total amount of the order in the smallest monetary unit (e.g., cents).
line_items:
type: object
properties:
lines_item_id:
type: string
description: Primary key of the lines_item_id table
order_id:
type: string
description: Foreign key to the orders table
sku:
type: string
description: The purchased article number
Example (string):
schema:
type: json-schema
specification: |-
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"orders": {
"type": "object",
"description": "One record per order. Includes cancelled and deleted orders.",
"properties": {
"order_id": {
"type": "string",
"description": "Primary key of the orders table"
},
"order_timestamp": {
"type": "string",
"format": "date-time",
"description": "The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful."
},
"order_total": {
"type": "integer",
"description": "Total amount of the order in the smallest monetary unit (e.g., cents)."
}
},
"required": ["order_id", "order_timestamp", "order_total"]
},
"line_items": {
"type": "object",
"properties": {
"lines_item_id": {
"type": "string",
"description": "Primary key of the lines_item_id table"
},
"order_id": {
"type": "string",
"description": "Foreign key to the orders table"
},
"sku": {
"type": "string",
"description": "The purchased article number"
}
},
"required": ["lines_item_id", "order_id", "sku"]
}
},
"required": ["orders", "line_items"]
}
Classical SQL DDLs can be used to describe the structure.
Example (string):
schema:
type: sql-ddl
specification: |-
-- One record per order. Includes cancelled and deleted orders.
CREATE TABLE orders (
order_id TEXT PRIMARY KEY, -- Primary key of the orders table
order_timestamp TIMESTAMPTZ NOT NULL, -- The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful.
order_total INTEGER NOT NULL -- Total amount of the order in the smallest monetary unit (e.g., cents)
);
-- The items that are part of an order
CREATE TABLE line_items (
lines_item_id TEXT PRIMARY KEY, -- Primary key of the lines_item_id table
order_id TEXT REFERENCES orders(order_id), -- Foreign key to the orders table
sku TEXT NOT NULL -- The purchased article number
);
Field | Type | Description |
---|---|---|
type | string |
The type of the data product technology that implements the data contract. Well-known server types are: csv , json , yaml , custom |
description | string |
An optional string describing the example. |
model | string |
The reference to the model in the schema, e.g. a table name. |
data | string |
Example data for this model. |
Example:
examples:
- type: csv
model: orders
data: |-
order_id,order_timestamp,order_total
"1001","2023-09-09T08:30:00Z",2500
"1002","2023-09-08T15:45:00Z",1800
"1003","2023-09-07T12:15:00Z",3200
"1004","2023-09-06T19:20:00Z",1500
"1005","2023-09-05T10:10:00Z",4200
"1006","2023-09-04T14:55:00Z",2800
"1007","2023-09-03T21:05:00Z",1900
"1008","2023-09-02T17:40:00Z",3600
"1009","2023-09-01T09:25:00Z",3100
"1010","2023-08-31T22:50:00Z",2700
A service level is defined as an agreed-upon, measurable level of performance for provided the data. Data Contract Specification defines well-known service levels. This list can be extended with custom service levels.
One can either describe each service level informally using the description
field, or make use of the predefined fields for automation support, e.g., via the Data Contract CLI.
Field | Type | Description |
---|---|---|
availability | Availability Object | The promised uptime of the system that provides the data |
retention | Retention Object | The period how long data will be available. |
latency | Latency Object | The maximum amount of time from the the source to its destination. |
freshness | Freshness Object | The maximum age of the youngest entry. |
frequency | Frequency Object | The update frequency. |
support | Support Object | The times when support is provided. |
backup | Backup Object | The details about data backup procedures. |
This object MAY be extended with Specification Extensions.
Availability refers to the promise or guarantee by the service provider about the uptime of the system that provides the data.
Field | Type | Description |
---|---|---|
description | string |
An optional string describing the availability service level. |
percentage | string |
An optional string describing the guaranteed uptime in percent (e.g., 99.9% ) |
This object MAY be extended with Specification Extensions.
Retention covers the period how long data will be available.
Field | Type | Description |
---|---|---|
description | string |
An optional string describing the retention service level. |
period | string |
An optional period of time, how long data is available. Supported formats: Simple duration (e.g., 1 year , 30d ) and ISO 8601 duration (e.g, P1Y ). |
unlimited | boolean |
An optional indicator that data is kept forever. |
timestampField | string |
An optional reference to the field that contains the timestamp that the period refers to. |
This object MAY be extended with Specification Extensions.
Latency refers to the maximum amount of time from the source to its destination.
Examples are the maximum duration it takes after an order has been recorded in the ecommerce shop until it is available in the orders table in the data analytics platform. This includes the waiting times until the next batch run is started and the processing time of the pipeline.
Field | Type | Description |
---|---|---|
description | string |
An optional string describing the latency service level. |
threshold | string |
An optional maximum duration between the source timestamp and the processed timestamp. Supported formats: Simple duration (e.g., 24 hours , 5s ) and ISO 8601 duration (e.g, PT24H ). |
sourceTimestampField | string |
An optional reference to the field that contains the timestamp when the data was provided at the source. |
processedTimestampField | string |
An optional reference to the field that contains the processing timestamp, which denotes when the data is made available to consumers of this data contract. |
This object MAY be extended with Specification Extensions.
Freshness refers to the maximum age of the youngest entry.
Field | Type | Description |
---|---|---|
description | string |
An optional string describing the freshness service level. |
threshold | string |
An optional maximum age of the youngest entry. Supported formats: Simple duration (e.g., 24 hours , 5s ) and ISO 8601 duration (e.g, PT24H ). |
timestampField | string |
An optional reference to the field that contains the timestamp that the threshold refers to. |
This object MAY be extended with Specification Extensions.
Frequency describes how often data is updated.
Field | Type | Description |
---|---|---|
description | string |
An optional string describing the frequency service level. |
type | string |
An optional type of data processing. Typical values are batch , micro-batching , streaming , manual . |
interval | string |
Optional. Only for batch: How often the pipeline is triggered, e.g., daily . |
cron | string |
Optional. Only for batch: A cron expression when the pipelines is triggered. E.g., 0 0 * * * . |
This object MAY be extended with Specification Extensions.
Support describes the times when support will be available for contact.
Field | Type | Description |
---|---|---|
description | string |
An optional string describing the support service level. |
time | string |
An optional string describing the times when support will be available for contact such as 24/7 or business hours only . |
responseTime | string |
An optional string describing the time it takes for the support team to acknowledge a request. This does not mean the issue will be resolved immediately, but it assures users that their request has been received and will be dealt with. |
This object MAY be extended with Specification Extensions.
Backup specifies details about data backup procedures.
Field | Type | Description |
---|---|---|
description | string |
An optional string describing the backup service level. |
interval | string |
An optional interval that defines how often data will be backed up, e.g., daily . |
cron | string |
An optional cron expression when data will be backed up, e.g., 0 0 * * * . |
recoveryTime | string |
An optional Recovery Time Objective (RTO) specifies the maximum amount of time allowed to restore data from a backup after a failure or loss event (e.g., 4 hours, 24 hours). |
recoveryPoint | string |
An optional Recovery Point Objective (RPO) defines the maximum acceptable age of files that must be recovered from backup storage for normal operations to resume after a disaster or data loss event. This essentially measures how much data you can afford to lose, measured in time (e.g., 4 hours, 24 hours). |
The quality object contains quality attributes and checks.
Field | Type | Description |
---|---|---|
type | string |
REQUIRED. The type of the schema. Typical values are: SodaCL , montecarlo , great-expectations , custom |
specification | SodaCL Quality Object | Monte Carlo Schema Object | Great Expectations Quality Object | string |
REQUIRED. The specification of the quality attributes. The quality specification can be encoded as a string or as inline YAML. |
Quality attributes in Soda Checks Language.
The specification
represents the content of a checks.yml
file.
Example (inline):
quality:
type: SodaCL # data quality check format: SodaCL, montecarlo, dbt-tests, custom
specification: # expressed as string or inline yaml or via "$ref: checks.yaml"
checks for orders:
- row_count > 0
- duplicate_count(order_id) = 0
checks for line_items:
- row_count > 0
Example (string):
quality:
type: SodaCL
specification: |-
checks for search_queries:
- freshness(search_timestamp) < 1d
- row_count > 100000
- missing_count(search_query) = 0
Quality attributes defined as Monte Carlos Monitors as Code.
The specification
represents the content of a montecarlo.yml
file.
Example (string):
quality:
type: montecarlo
specification: |-
montecarlo:
field_health:
- table: project:dataset.table_name
timestamp_field: created
dimension_tracking:
- table: project:dataset.table_name
timestamp_field: created
field: order_status
Quality attributes defined as Great Expectations Expectations.
The specification
represents a list of expectations on a specific model.
Example (string):
quality:
type: great-expectations
specification:
orders: |-
[
{
"expectation_type": "expect_table_row_count_to_be_between",
"kwargs": {
"min_value": 10
},
"meta": {
}
}
]
The config field can be used to set additional metadata that may be used by tools, e.g. to define a namespace for code generation, specify physical data types, toggle tests, etc.
A config field can be added with any name. The value can be null, a primitive, an array or an object.
For developer experience, a list of well-known field names is maintained here, as these fields are used in the Data Contract CLI:
Field | Type | Description |
---|---|---|
avroNamespace | string |
(Only on model level) The namespace to use when importing and exporting the data model from / to Apache Avro. |
avroType | string |
(Only on field level) Specify the field type to use when exporting the data model to Apache Avro. |
avroLogicalType | string |
(Only on field level) Specify the logical field type to use when exporting the data model to Apache Avro. |
bigqueryType | string |
(Only on field level) Specify the physical column type that is used in a BigQuery table, e.g., NUMERIC(5, 2) |
snowflakeType | string |
(Only on field level) Specify the physical column type that is used in a Snowflake table, e.g, TIMESTAMP_LTZ |
redshiftType | string |
(Only on field level) Specify the physical column type that is used in a Redshift table, e.g, SMALLINT |
sqlserverType | string |
(Only on field level) Specify the physical column type that is used in a Snowflake table, e.g, DATETIME2 |
databricksType | string |
(Only on field level) Specify the physical column type that is used in a Databricks table |
glueType | string |
(Only on field level) Specify the physical column type that is used in a AWS Glue Data Catalog table |
This object MAY be extended with Specification Extensions.
Example:
models:
orders:
config:
avroNamespace: "my.namespace"
fields:
my_field_1:
description: Example for AVRO with Timestamp (millisecond precision)
type: timestamp
config:
avroType: long
avroLogicalType: timestamp-millis
snowflakeType: timestamp_tz
The following data types are supported for model fields and definitions:
- Unicode character sequence:
string
,text
,varchar
- Any numeric type, either integers or floating point numbers:
number
,decimal
,numeric
- 32-bit signed integer:
int
,integer
- 64-bit signed integer:
long
,bigint
- Single precision (32-bit) IEEE 754 floating-point number:
float
- Double precision (64-bit) IEEE 754 floating-point number:
double
- Binary value:
boolean
- Timestamp with timezone:
timestamp
,timestamp_tz
- Timestamp with no timezone:
timestamp_ntz
- Date with no time information:
date
- Array:
array
- Map:
map
(may not be supported by some server types) - Sequence of 8-bit unsigned bytes:
bytes
- Complex type:
object
,record
,struct
- No value:
null
While the Data Contract Specification tries to accommodate most use cases, additional data can be added to extend the specification at certain points.
A custom field can be added with any name. The value can be null, a primitive, an array or an object.
- Data Contract CLI is an open-source CLI tool to help you create, develop, and maintain your data contracts.
- Data Contract Manager is a commercial tool to manage data contracts. It includes a data contract catalog, a Web-Editor, and a request and approval workflow to automate access to data products for a full enterprise data marketplace.
- Data Contract GPT is a custom GPT that can help you write data contracts.
- Data Contract Editor is an open-source editor for Data Contracts, including a live html preview.
The JSON Schema of the current data contract specification is registered in Schema Store, which brings code completion and syntax checks for all major IDEs. IntelliJ comes with a built-in YAML plugin which will show you autocompletions. For VS Code we recommend to install the YAML plugin. No additional configuration is required.
Autocompletion is then enabled for files following these patterns:
datacontract.yaml
datacontract.yml
*-datacontract.yaml
*-datacontract.yml
*.datacontract.yaml
*.datacontract.yml
datacontract-*.yaml
datacontract-*.yml
**/datacontract/*.yml
**/datacontract/*.yaml
**/datacontracts/*.yml
**/datacontracts/*.yaml
The Data Contract Specification was originally created by Jochen Christ and Dr. Simon Harrer, and is currently maintained by them.
Contributions are welcome! Please open an issue or a pull request.
<style>.github-corner:hover .octo-arm{animation:octocat-wave 560ms ease-in-out}@keyframes octocat-wave{0%,100%{transform:rotate(0)}20%,60%{transform:rotate(-25deg)}40%,80%{transform:rotate(10deg)}}@media (max-width:500px){.github-corner:hover .octo-arm{animation:none}.github-corner .octo-arm{animation:octocat-wave 560ms ease-in-out}}</style>